# Pandas DataFrame Basics for Data Science

This notebook provides a comprehensive introduction to working with pandas DataFrames, one of the most essential data structures for data analysis and manipulation in Python.

## Import Required Libraries

First, let's import pandas and other necessary libraries for working with DataFrames.

In [None]:
# Import the core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings for better visualization
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 1000)
pd.set_option('display.expand_frame_repr', False)

# Check the versions
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")

## Creating DataFrames

There are multiple ways to create pandas DataFrames. Let's explore the most common methods:
1. From dictionaries
2. From lists
3. From external files (CSV, Excel)
4. From other data sources

In [None]:
# 1. Creating DataFrame from a dictionary
data_dict = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 34, 29, 42],
    'City': ['New York', 'Paris', 'Berlin', 'London'],
    'Salary': [65000, 70000, 62000, 85000]
}

df_dict = pd.DataFrame(data_dict)
print("DataFrame from dictionary:")
display(df_dict)

In [None]:
# 2. Creating DataFrame from lists
data_list = [
    ['John', 28, 'New York', 65000],
    ['Anna', 34, 'Paris', 70000],
    ['Peter', 29, 'Berlin', 62000],
    ['Linda', 42, 'London', 85000]
]

column_names = ['Name', 'Age', 'City', 'Salary']

df_list = pd.DataFrame(data_list, columns=column_names)
print("\nDataFrame from list:")
display(df_list)

In [None]:
# 3. Creating DataFrame from CSV (let's create a CSV file first)
df_dict.to_csv('sample_data.csv', index=False)

# Now read it back
df_csv = pd.read_csv('sample_data.csv')
print("\nDataFrame from CSV:")
display(df_csv)

In [None]:
# 4. Creating DataFrame from Excel (let's create an Excel file first)
df_dict.to_excel('sample_data.xlsx', index=False)

# Now read it back
df_excel = pd.read_excel('sample_data.xlsx')
print("\nDataFrame from Excel:")
display(df_excel)

In [None]:
# 5. Creating DataFrame from NumPy array
numpy_array = np.random.randn(5, 4)  # Create a 5x4 array of random numbers
df_numpy = pd.DataFrame(numpy_array, columns=['A', 'B', 'C', 'D'])
print("\nDataFrame from NumPy array:")
display(df_numpy)

## Exploring DataFrame Structure

Once we have a DataFrame, we need to understand its structure and contents. Pandas provides several methods to explore DataFrames:

In [None]:
# Let's create a more substantial DataFrame to work with
df = pd.DataFrame({
    'Name': ['John', 'Anna', 'Peter', 'Linda', 'Bob', 'Sarah', 'Mike', 'Emma', 'David', 'Kate'],
    'Age': [28, 34, 29, 42, 35, 31, 40, 27, 38, 44],
    'City': ['New York', 'Paris', 'Berlin', 'London', 'Madrid', 'Rome', 'Tokyo', 'Sydney', 'Toronto', 'Moscow'],
    'Department': ['IT', 'HR', 'Sales', 'IT', 'Finance', 'HR', 'Sales', 'IT', 'Finance', 'Sales'],
    'Salary': [65000, 70000, 62000, 85000, 72000, 69000, 76000, 68000, 81000, 73000],
    'Experience': [3, 7, 4, 12, 8, 5, 10, 2, 9, 14],
    'Active': [True, True, False, True, True, False, True, True, False, True]
})

# Checking the shape (rows, columns)
print(f"DataFrame shape: {df.shape}")

# Get a quick overview of the DataFrame
print("\nDataFrame info:")
df.info()

# Display the first few rows
print("\nFirst 5 rows:")
display(df.head())

# Display the last few rows
print("\nLast 5 rows:")
display(df.tail())

# Display a random sample
print("\nRandom sample of 3 rows:")
display(df.sample(3))

# Get column names
print(f"\nColumn names: {df.columns.tolist()}")

# Get data types of each column
print("\nData types:")
display(df.dtypes)

# Get basic statistics for numeric columns
print("\nBasic statistics:")
display(df.describe())

# Include statistics for non-numeric columns as well
print("\nBasic statistics for all columns:")
display(df.describe(include='all'))

## Basic Data Selection and Filtering

Pandas offers multiple ways to select and filter data from DataFrames:
1. Column selection
2. Using `loc[]` for label-based indexing
3. Using `iloc[]` for integer-based indexing
4. Boolean indexing
5. The `query()` method
6. Working with multi-index DataFrames

In [None]:
# 1. Column selection
# Select a single column (returns a Series)
name_series = df['Name']
print("Single column selection (Series):")
display(name_series.head())

# Select multiple columns (returns a DataFrame)
subset = df[['Name', 'Age', 'Salary']]
print("\nMultiple column selection:")
display(subset.head())

In [None]:
# 2. Using loc[] for label-based indexing
# Select rows by label/index and columns by name
print("Using loc[] to select specific rows and columns:")
display(df.loc[0:2, ['Name', 'Age', 'City']])

# 3. Using iloc[] for integer-based indexing
# Select rows and columns by position
print("\nUsing iloc[] to select by position:")
display(df.iloc[0:3, 0:3])

In [None]:
# 4. Boolean indexing
# Filter rows where Age > 35
older_employees = df[df['Age'] > 35]
print("Employees older than 35:")
display(older_employees)

# Filter with multiple conditions
it_dept_high_salary = df[(df['Department'] == 'IT') & (df['Salary'] > 70000)]
print("\nIT employees with salary > 70000:")
display(it_dept_high_salary)

In [None]:
# 5. Using query() method
# Same filter as above but using query()
query_result = df.query("Department == 'IT' and Salary > 70000")
print("Using query() method:")
display(query_result)

In [None]:
# 6. Working with multi-index DataFrames
# Create a multi-index DataFrame
multi_index_df = df.set_index(['Department', 'City'])
print("Multi-index DataFrame:")
display(multi_index_df.head())

# Select data for a specific index level
it_dept = multi_index_df.loc['IT']
print("\nJust the IT Department:")
display(it_dept)

# Cross-section selection using xs()
new_york_employees = multi_index_df.xs('New York', level='City')
print("\nEmployees in New York across departments:")
display(new_york_employees)

## Adding and Modifying Data

Let's explore how to add and modify data in pandas DataFrames:
1. Adding new columns
2. Modifying values
3. Renaming columns
4. Applying functions with `apply()` and `map()`

In [None]:
# Make a copy of the original DataFrame to work with
df_mod = df.copy()

# 1. Adding a new column
df_mod['Bonus'] = df_mod['Salary'] * 0.1
print("DataFrame with new Bonus column:")
display(df_mod.head())

# Add a column based on multiple other columns
df_mod['Total Compensation'] = df_mod['Salary'] + df_mod['Bonus']
display(df_mod[['Name', 'Salary', 'Bonus', 'Total Compensation']].head())

In [None]:
# 2. Modifying values
# Modify a single value
df_mod.loc[0, 'Bonus'] = 10000
print("After modifying a single value:")
display(df_mod.loc[0])

# Modify values based on a condition
df_mod.loc[df_mod['Experience'] > 10, 'Bonus'] = df_mod['Salary'] * 0.15
print("\nAfter increasing bonus for experienced employees:")
display(df_mod[df_mod['Experience'] > 10][['Name', 'Experience', 'Salary', 'Bonus']])

In [None]:
# 3. Renaming columns
df_mod = df_mod.rename(columns={
    'Salary': 'Base Salary',
    'Experience': 'Years of Experience'
})
print("After renaming columns:")
display(df_mod.head())

In [None]:
# 4. Applying functions with apply() and map()
# Using apply() on a column - calculate salary per year of experience
df_mod['Salary per Year'] = df_mod.apply(
    lambda row: row['Base Salary'] / row['Years of Experience'] if row['Years of Experience'] > 0 else 0, 
    axis=1
)

print("After adding calculated column with apply():")
display(df_mod[['Name', 'Base Salary', 'Years of Experience', 'Salary per Year']].head())

# Using map() to transform a column
department_tier = {
    'IT': 'Technical',
    'HR': 'Administrative',
    'Sales': 'Business',
    'Finance': 'Business'
}

df_mod['Department Type'] = df_mod['Department'].map(department_tier)
print("\nAfter adding Department Type with map():")
display(df_mod[['Name', 'Department', 'Department Type']].head())

## Handling Missing Values

Missing values are a common issue in real-world datasets. Pandas provides several methods to handle them:
1. Detecting missing values
2. Removing missing values
3. Filling missing values

In [None]:
# Create a DataFrame with some missing values
df_missing = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, np.nan, 5],
    'D': [np.nan, np.nan, 3, 4, 5]
})

print("DataFrame with missing values:")
display(df_missing)

In [None]:
# 1. Detecting missing values
print("Missing value check (True means missing):")
display(df_missing.isna())

# Count number of missing values in each column
print("\nNumber of missing values per column:")
display(df_missing.isna().sum())

# Count number of missing values in each row
print("\nNumber of missing values per row:")
display(df_missing.isna().sum(axis=1))

# Find rows with any missing values
print("\nRows with at least one missing value:")
display(df_missing[df_missing.isna().any(axis=1)])

# Find columns with any missing values
print("\nColumns with missing values:")
missing_cols = df_missing.columns[df_missing.isna().any()]
print(missing_cols.tolist())

In [None]:
# 2. Removing missing values
# Drop rows with any missing values
df_dropped = df_missing.dropna()
print("After dropping rows with missing values:")
display(df_dropped)

# Drop rows where all values are missing
df_dropped_all = df_missing.dropna(how='all')
print("\nAfter dropping rows where all values are missing:")
display(df_dropped_all)

# Drop columns with any missing values
df_dropped_cols = df_missing.dropna(axis=1)
print("\nAfter dropping columns with missing values:")
display(df_dropped_cols)

In [None]:
# 3. Filling missing values
# Fill with a specific value
df_filled = df_missing.fillna(0)
print("After filling missing values with 0:")
display(df_filled)

# Fill with column means
df_filled_mean = df_missing.fillna(df_missing.mean())
print("\nAfter filling missing values with column means:")
display(df_filled_mean)

# Fill with column medians
df_filled_median = df_missing.fillna(df_missing.median())
print("\nAfter filling missing values with column medians:")
display(df_filled_median)

# Fill forward (use previous value)
df_ffill = df_missing.fillna(method='ffill')
print("\nAfter forward fill:")
display(df_ffill)

# Fill backward (use next value)
df_bfill = df_missing.fillna(method='bfill')
print("\nAfter backward fill:")
display(df_bfill)

# Fill with different values for each column
fill_values = {'A': 0, 'B': -1, 'C': 999, 'D': -999}
df_fill_dict = df_missing.fillna(value=fill_values)
print("\nAfter filling with different values per column:")
display(df_fill_dict)

## Basic Statistical Operations

Pandas provides built-in methods for statistical operations on DataFrames:
1. Core statistical methods
2. Using `agg()` for multiple operations

In [None]:
# Let's use our original employee DataFrame
print("Original DataFrame (first 5 rows):")
display(df.head())

# 1. Core statistical methods
# Calculate the mean of numeric columns
print("\nMean values:")
display(df.mean(numeric_only=True))

# Calculate the median
print("\nMedian values:")
display(df.median(numeric_only=True))

# Calculate minimum and maximum
print("\nMinimum values:")
display(df.min(numeric_only=True))
print("\nMaximum values:")
display(df.max(numeric_only=True))

# Calculate standard deviation
print("\nStandard deviation:")
display(df.std(numeric_only=True))

# Calculate sum
print("\nSum of values:")
display(df.sum(numeric_only=True))

# Count non-null values
print("\nCount of non-null values:")
display(df.count())

In [None]:
# 2. Using agg() for multiple operations
# Apply multiple functions to numeric columns
stats = df.agg(['min', 'max', 'mean', 'median', 'std'])
print("Multiple statistics with agg():")
display(stats)

# Apply different functions to different columns
custom_agg = df.agg({
    'Age': ['min', 'max', 'mean'],
    'Salary': ['mean', 'median', 'std'],
    'Experience': ['min', 'max', 'mean']
})

print("\nCustom aggregations by column:")
display(custom_agg)

In [None]:
# Calculate correlation between numeric columns
corr = df.corr()
print("Correlation matrix:")
display(corr)

# Visualize the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()

## GroupBy Operations

The `groupby()` method is a powerful tool for grouped data analysis:
1. Basic groupby operations
2. Aggregation with groupby
3. Transformations with groupby
4. Split-apply-combine pattern

In [None]:
# 1. Basic groupby operations
# Group by a single column
dept_groups = df.groupby('Department')

# Get the size of each group
print("Number of employees in each department:")
display(dept_groups.size())

# Get basic statistics for each group
print("\nBasic statistics for each department:")
display(dept_groups.describe())

# Access a specific group
print("\nIT Department details:")
display(dept_groups.get_group('IT'))

In [None]:
# 2. Aggregation with groupby
# Calculate mean values for each department
dept_means = dept_groups.mean(numeric_only=True)
print("Mean values for each department:")
display(dept_means)

# Apply multiple aggregation functions
dept_aggs = dept_groups.agg({
    'Age': 'mean',
    'Salary': ['mean', 'min', 'max', 'std'],
    'Experience': ['mean', 'min', 'max']
})

print("\nMultiple aggregations by department:")
display(dept_aggs)

In [None]:
# 3. Transformations with groupby
# Add a column showing the mean salary for each department
df['Dept Average Salary'] = df.groupby('Department')['Salary'].transform('mean')

# Calculate the difference between individual salary and department average
df['Salary vs Dept Avg'] = df['Salary'] - df['Dept Average Salary']

print("DataFrame with transformed columns:")
display(df[['Name', 'Department', 'Salary', 'Dept Average Salary', 'Salary vs Dept Avg']])

In [None]:
# 4. Multiple grouping levels
# Group by Department and City
multi_group = df.groupby(['Department', 'City'])

print("Number of employees by Department and City:")
display(multi_group.size())

print("\nAverage age and salary by Department and City:")
display(multi_group[['Age', 'Salary']].mean())

# Filter groups that have more than 1 employee
print("\nGroups with more than 1 employee:")
group_filter = multi_group.filter(lambda x: len(x) > 1)
display(group_filter)

In [None]:
# Visualize the average salary by department
avg_salary_by_dept = df.groupby('Department')['Salary'].mean().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
avg_salary_by_dept.plot(kind='bar', color='skyblue')
plt.title('Average Salary by Department')
plt.xlabel('Department')
plt.ylabel('Average Salary ($)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.xticks(rotation=0)

for i, v in enumerate(avg_salary_by_dept):
    plt.text(i, v + 1000, f"${v:,.0f}", ha='center')

plt.tight_layout()
plt.show()

## Conclusion

This notebook covered the essential operations for working with pandas DataFrames:

1. Creating DataFrames from various data sources
2. Exploring and understanding DataFrame structure
3. Selecting and filtering data using different methods
4. Adding and modifying DataFrame content
5. Handling missing values effectively
6. Performing statistical operations
7. Using groupby for aggregated analysis

These fundamentals provide a strong foundation for more advanced data analysis and manipulation tasks in Python.