<a href="https://www.kaggle.com/code/matinmahmoudi/pandas-fun-problems-dataframe?scriptVersionId=180219113" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 🌿 Pandas Mastery Series - DataFrame

Welcome to the Pandas Mastery Series! In this notebook, we will dive deep into the core structure of pandas: the DataFrame. Our goal is to cover every aspect of DataFrames, providing detailed explanations, practical examples, and fun challenges to enhance your learning experience. Let's get started on our journey to mastering DataFrames!

## Table of Contents

1. **Introduction to DataFrames**
    - What is a DataFrame?
    - Key Features of DataFrames


2. **DataFrame Creation**
    - From Dictionary
    - From List of Lists
    - From List of Dictionaries
    - From Another DataFrame


3. **DataFrame Inspection**
    - DataFrame.info()
    - DataFrame.shape
    - DataFrame.axes
    - DataFrame.size


4. **To And From CSV**
    - DataFrame.to_csv()
    - pandas.read_csv()


5. **Basic Indexing**
    - Indexing Columns
    - Indexing Rows
    - Boolean Indexing


6. **Basic Operations**
    - Inserting Columns
    - Updating Values
    - Removing Columns
    - Changing Column Names


7. **DataFrame.apply()**
    - apply() with Function Arguments


8. **Merging DataFrames**
    - Inner Join
    - Left Join
    - Right Join
    - Outer Join
    - Anti-Join Methods


9. **Aggregation**


10. **Group By**
    - GroupBy Aggregate
    - Renaming Output Columns
    - GroupBy Transform


11. **Fun Challenges**
    - Challenge 1: The Mischievous Data Entry
    - Challenge 2: The Lost Column
    - Challenge 3: The Data Detective
    - Challenge 4: The Aggregation Adventure
    - Challenge 5: The Plotting Puzzle


### Ready for the Ultimate Challenge?

Once you've completed all the notebooks in the Pandas Mastery Series, you'll be ready to tackle the final challenge: [Pandas Mastery Series - Ultimate Challenge](https://www.kaggle.com/code/matinmahmoudi/pandas-mastery-series-ultimate-challenge). This ultimate challenge will put your pandas skills to the test and ensure you're truly a pandas master.

Let's get started and become pandas DataFrame masters!


# 1. Introduction to DataFrames

A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns) in pandas. It is similar to a spreadsheet or SQL table, and it is one of the most commonly used objects for data manipulation and analysis in pandas.

### What is a DataFrame?

A DataFrame consists of rows and columns:
- **Rows**: Represent individual records or observations.
- **Columns**: Represent different variables or features of the data.

### Key Features of DataFrames
- **Heterogeneous data**: Each column can contain data of different types (integers, floats, strings, etc.).
- **Labeled axes**: Both rows and columns have labels.
- **Size-mutable**: DataFrames can be expanded or contracted.


In [1]:
# Import pandas library
import pandas as pd

# Creating a DataFrame from a dictionary
# Define a dictionary with keys as column names and values as lists of column data
data_dict = {
    'Name': ['Frodo', 'Sam', 'Gandalf', 'Aragorn', 'Legolas'],
    'Race': ['Hobbit', 'Hobbit', 'Wizard', 'Human', 'Elf'],
    'Age': [50, 38, 2019, 87, 2931],
    'Role': ['Ring-bearer', 'Gardener', 'Wizard', 'King', 'Archer']
}
# Create a DataFrame using the dictionary
df_dict = pd.DataFrame(data_dict)
# Print the DataFrame
print("DataFrame created from a dictionary:\n", df_dict)


DataFrame created from a dictionary:
       Name    Race   Age         Role
0    Frodo  Hobbit    50  Ring-bearer
1      Sam  Hobbit    38     Gardener
2  Gandalf  Wizard  2019       Wizard
3  Aragorn   Human    87         King
4  Legolas     Elf  2931       Archer


# 2. DataFrame Creation

Creating DataFrames is fundamental in pandas. Let's explore different ways to create DataFrames using characters from "The Lord of the Rings" for a fun twist!

### Creating a DataFrame from a Dictionary
A dictionary can be used to create a DataFrame where keys are column names and values are lists of column data.

### Creating a DataFrame from a List of Lists
Lists of lists can represent rows of data. We can create a DataFrame by specifying the column names.

### Creating a DataFrame from a List of Dictionaries
Each dictionary in a list can represent a row of data, making it easy to convert into a DataFrame.

### Creating a DataFrame from Another DataFrame
We can also create a DataFrame by copying an existing DataFrame.


In [2]:
# Import pandas library
import pandas as pd

# Creating a DataFrame from a dictionary
# Define a dictionary with keys as column names and values as lists of column data
data_dict = {
    'Name': ['Frodo', 'Sam', 'Gandalf', 'Aragorn', 'Legolas'],
    'Race': ['Hobbit', 'Hobbit', 'Wizard', 'Human', 'Elf'],
    'Age': [50, 38, 2019, 87, 2931],
    'Role': ['Ring-bearer', 'Gardener', 'Wizard', 'King', 'Archer']
}
# Create a DataFrame using the dictionary
df_dict = pd.DataFrame(data_dict)
# Print the DataFrame
print("DataFrame created from a dictionary:\n", df_dict)

# Creating a DataFrame from a list of lists
# Define a list of lists where each inner list represents a row of data
data_list = [
    ['Frodo', 'Hobbit', 50, 'Ring-bearer'],
    ['Sam', 'Hobbit', 38, 'Gardener'],
    ['Gandalf', 'Wizard', 2019, 'Wizard'],
    ['Aragorn', 'Human', 87, 'King'],
    ['Legolas', 'Elf', 2931, 'Archer']
]
# Create a DataFrame using the list of lists and specify column names
df_list = pd.DataFrame(data_list, columns=['Name', 'Race', 'Age', 'Role'])
# Print the DataFrame
print("\nDataFrame created from a list of lists:\n", df_list)

# Creating a DataFrame from a list of dictionaries
# Define a list where each element is a dictionary representing a row
data_list_dicts = [
    {'Name': 'Frodo', 'Race': 'Hobbit', 'Age': 50, 'Role': 'Ring-bearer'},
    {'Name': 'Sam', 'Race': 'Hobbit', 'Age': 38, 'Role': 'Gardener'},
    {'Name': 'Gandalf', 'Race': 'Wizard', 'Age': 2019, 'Role': 'Wizard'},
    {'Name': 'Aragorn', 'Race': 'Human', 'Age': 87, 'Role': 'King'},
    {'Name': 'Legolas', 'Race': 'Elf', 'Age': 2931, 'Role': 'Archer'}
]
# Create a DataFrame using the list of dictionaries
df_list_dicts = pd.DataFrame(data_list_dicts)
# Print the DataFrame
print("\nDataFrame created from a list of dictionaries:\n", df_list_dicts)

# Creating a DataFrame from another DataFrame (copying)
# Create a copy of the existing DataFrame
df_copy = df_dict.copy()
# Print the DataFrame
print("\nDataFrame created by copying another DataFrame:\n", df_copy)


DataFrame created from a dictionary:
       Name    Race   Age         Role
0    Frodo  Hobbit    50  Ring-bearer
1      Sam  Hobbit    38     Gardener
2  Gandalf  Wizard  2019       Wizard
3  Aragorn   Human    87         King
4  Legolas     Elf  2931       Archer

DataFrame created from a list of lists:
       Name    Race   Age         Role
0    Frodo  Hobbit    50  Ring-bearer
1      Sam  Hobbit    38     Gardener
2  Gandalf  Wizard  2019       Wizard
3  Aragorn   Human    87         King
4  Legolas     Elf  2931       Archer

DataFrame created from a list of dictionaries:
       Name    Race   Age         Role
0    Frodo  Hobbit    50  Ring-bearer
1      Sam  Hobbit    38     Gardener
2  Gandalf  Wizard  2019       Wizard
3  Aragorn   Human    87         King
4  Legolas     Elf  2931       Archer

DataFrame created by copying another DataFrame:
       Name    Race   Age         Role
0    Frodo  Hobbit    50  Ring-bearer
1      Sam  Hobbit    38     Gardener
2  Gandalf  Wizard  201

# 3. DataFrame Inspection

Inspecting a DataFrame is crucial for understanding the structure and content of your data. pandas provides several methods to inspect DataFrames.

### DataFrame.info()
The `info()` method provides a concise summary of a DataFrame, including the index dtype, column dtypes, non-null values, and memory usage.

### DataFrame.shape
The `shape` attribute returns a tuple representing the dimensionality of the DataFrame (rows, columns).

### DataFrame.axes
The `axes` attribute returns a list representing the row and column axis labels.

### DataFrame.size
The `size` attribute returns the number of elements in the DataFrame.


In [3]:
# Import pandas library
import pandas as pd

# Creating a DataFrame from a dictionary for inspection
data_dict = {
    'Name': ['Frodo', 'Sam', 'Gandalf', 'Aragorn', 'Legolas'],
    'Race': ['Hobbit', 'Hobbit', 'Wizard', 'Human', 'Elf'],
    'Age': [50, 38, 2019, 87, 2931],
    'Role': ['Ring-bearer', 'Gardener', 'Wizard', 'King', 'Archer']
}
df = pd.DataFrame(data_dict)

# Using DataFrame.info() to get a concise summary of the DataFrame
print("DataFrame.info() output:")
df.info()

# Using DataFrame.shape to get the dimensionality of the DataFrame
print("\nDataFrame.shape output:", df.shape)

# Using DataFrame.axes to get the row and column axis labels
print("\nDataFrame.axes output:", df.axes)

# Using DataFrame.size to get the number of elements in the DataFrame
print("\nDataFrame.size output:", df.size)


DataFrame.info() output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Race    5 non-null      object
 2   Age     5 non-null      int64 
 3   Role    5 non-null      object
dtypes: int64(1), object(3)
memory usage: 288.0+ bytes

DataFrame.shape output: (5, 4)

DataFrame.axes output: [RangeIndex(start=0, stop=5, step=1), Index(['Name', 'Race', 'Age', 'Role'], dtype='object')]

DataFrame.size output: 20


# 4. To And From CSV

CSV (Comma-Separated Values) files are a common format for storing tabular data. pandas provides easy-to-use functions for reading from and writing to CSV files.

### DataFrame.to_csv()
The `to_csv()` method allows you to export a DataFrame to a CSV file. You can specify various options like the file path, delimiter, and whether to include the index.

### pandas.read_csv()
The `read_csv()` function is used to read a CSV file into a DataFrame. It offers numerous parameters for handling different CSV formats and data types.


In [4]:
# Import pandas library
import pandas as pd

# Creating a DataFrame from a dictionary to demonstrate CSV operations
data_dict = {
    'Name': ['Frodo', 'Sam', 'Gandalf', 'Aragorn', 'Legolas'],
    'Race': ['Hobbit', 'Hobbit', 'Wizard', 'Human', 'Elf'],
    'Age': [50, 38, 2019, 87, 2931],
    'Role': ['Ring-bearer', 'Gardener', 'Wizard', 'King', 'Archer']
}
df = pd.DataFrame(data_dict)

# Exporting the DataFrame to a CSV file
# Using the to_csv() method to write the DataFrame to a CSV file named 'lotr_characters.csv'
df.to_csv('lotr_characters.csv', index=False)
print("DataFrame exported to 'lotr_characters.csv'.")

# Reading the DataFrame back from the CSV file
# Using the read_csv() function to read the CSV file into a new DataFrame
df_from_csv = pd.read_csv('lotr_characters.csv')
print("\nDataFrame read from 'lotr_characters.csv':\n", df_from_csv)


DataFrame exported to 'lotr_characters.csv'.

DataFrame read from 'lotr_characters.csv':
       Name    Race   Age         Role
0    Frodo  Hobbit    50  Ring-bearer
1      Sam  Hobbit    38     Gardener
2  Gandalf  Wizard  2019       Wizard
3  Aragorn   Human    87         King
4  Legolas     Elf  2931       Archer


# 5. Basic Indexing

Indexing is essential for accessing and modifying data within a DataFrame. pandas provides multiple ways to index DataFrames, including indexing columns, rows, and using Boolean indexing.

### Indexing Columns
You can select columns of a DataFrame using the column name as a key.

### Indexing Rows
Rows can be selected using integer-location based indexing with the `iloc` method.

### Boolean Indexing
Boolean indexing allows you to filter data based on conditions.


In [5]:
# Import pandas library
import pandas as pd

# Creating a DataFrame from a dictionary for indexing examples
data_dict = {
    'Name': ['Frodo', 'Sam', 'Gandalf', 'Aragorn', 'Legolas'],
    'Race': ['Hobbit', 'Hobbit', 'Wizard', 'Human', 'Elf'],
    'Age': [50, 38, 2019, 87, 2931],
    'Role': ['Ring-bearer', 'Gardener', 'Wizard', 'King', 'Archer']
}
df = pd.DataFrame(data_dict)

# Indexing Columns
# Select the 'Name' column
names = df['Name']
print("Column 'Name':\n", names)

# Indexing Rows using iloc
# Select the first row
first_row = df.iloc[0]
print("\nFirst row using iloc:\n", first_row)

# Boolean Indexing
# Select rows where Age is greater than 100
age_filter = df[df['Age'] > 100]
print("\nRows where Age is greater than 100:\n", age_filter)


Column 'Name':
 0      Frodo
1        Sam
2    Gandalf
3    Aragorn
4    Legolas
Name: Name, dtype: object

First row using iloc:
 Name          Frodo
Race         Hobbit
Age              50
Role    Ring-bearer
Name: 0, dtype: object

Rows where Age is greater than 100:
       Name    Race   Age    Role
2  Gandalf  Wizard  2019  Wizard
4  Legolas     Elf  2931  Archer


# 6. Basic Operations

Performing basic operations on DataFrames is crucial for data manipulation. pandas provides various methods to insert, update, remove, and rename columns.

### Inserting Columns
You can add new columns to a DataFrame by assigning values to a new column name.

### Updating Values
Values in a DataFrame can be updated using indexing.

### Removing Columns
Columns can be removed using the `drop()` method.

### Changing Column Names
Column names can be changed using the `rename()` method or by directly modifying the `columns` attribute.


In [6]:
# Import pandas library
import pandas as pd

# Creating a DataFrame from a dictionary for basic operations
data_dict = {
    'Name': ['Frodo', 'Sam', 'Gandalf', 'Aragorn', 'Legolas'],
    'Race': ['Hobbit', 'Hobbit', 'Wizard', 'Human', 'Elf'],
    'Age': [50, 38, 2019, 87, 2931],
    'Role': ['Ring-bearer', 'Gardener', 'Wizard', 'King', 'Archer']
}
df = pd.DataFrame(data_dict)

# Inserting Columns
# Add a new column 'Home' with default values
df['Home'] = ['Shire', 'Shire', 'Middle-earth', 'Gondor', 'Mirkwood']
print("DataFrame after inserting 'Home' column:\n", df)

# Updating Values
# Update the 'Age' of 'Frodo' to 51
df.loc[df['Name'] == 'Frodo', 'Age'] = 51
print("\nDataFrame after updating 'Frodo''s Age:\n", df)

# Removing Columns
# Remove the 'Home' column
df = df.drop(columns=['Home'])
print("\nDataFrame after removing 'Home' column:\n", df)

# Changing Column Names
# Rename the 'Role' column to 'Occupation'
df = df.rename(columns={'Role': 'Occupation'})
print("\nDataFrame after renaming 'Role' to 'Occupation':\n", df)

# Alternatively, modify the columns attribute directly
df.columns = ['Character Name', 'Species', 'Years', 'Occupation']
print("\nDataFrame after modifying column names directly:\n", df)


DataFrame after inserting 'Home' column:
       Name    Race   Age         Role          Home
0    Frodo  Hobbit    50  Ring-bearer         Shire
1      Sam  Hobbit    38     Gardener         Shire
2  Gandalf  Wizard  2019       Wizard  Middle-earth
3  Aragorn   Human    87         King        Gondor
4  Legolas     Elf  2931       Archer      Mirkwood

DataFrame after updating 'Frodo''s Age:
       Name    Race   Age         Role          Home
0    Frodo  Hobbit    51  Ring-bearer         Shire
1      Sam  Hobbit    38     Gardener         Shire
2  Gandalf  Wizard  2019       Wizard  Middle-earth
3  Aragorn   Human    87         King        Gondor
4  Legolas     Elf  2931       Archer      Mirkwood

DataFrame after removing 'Home' column:
       Name    Race   Age         Role
0    Frodo  Hobbit    51  Ring-bearer
1      Sam  Hobbit    38     Gardener
2  Gandalf  Wizard  2019       Wizard
3  Aragorn   Human    87         King
4  Legolas     Elf  2931       Archer

DataFrame after renam

# 7. DataFrame.apply()

The `apply()` function in pandas allows you to apply a function along an axis of the DataFrame (either rows or columns). This is a powerful tool for data transformation and manipulation.

### apply() with Function Arguments
You can use the `apply()` method to apply a function to each row or column in a DataFrame. This is useful for operations that are not element-wise but need to be applied along a row or column.


In [7]:
# Import pandas library
import pandas as pd

# Creating a DataFrame for apply() examples
data_dict = {
    'Name': ['Frodo', 'Sam', 'Gandalf', 'Aragorn', 'Legolas'],
    'Race': ['Hobbit', 'Hobbit', 'Wizard', 'Human', 'Elf'],
    'Age': [50, 38, 2019, 87, 2931],
    'Role': ['Ring-bearer', 'Gardener', 'Wizard', 'King', 'Archer']
}
df = pd.DataFrame(data_dict)

# Define a function to classify characters based on age
def age_classification(age):
    if age < 100:
        return 'Young'
    elif 100 <= age < 1000:
        return 'Middle-aged'
    else:
        return 'Ancient'

# Apply the function to the 'Age' column
df['Age Group'] = df['Age'].apply(age_classification)
print("DataFrame after applying age_classification to 'Age' column:\n", df)

# Define a function that takes additional arguments
def custom_greeting(row, prefix='Hello', suffix='!'):
    return f"{prefix} {row['Name']} the {row['Race']}{suffix}"

# Apply the function to each row with additional arguments
df['Greeting'] = df.apply(custom_greeting, axis=1, prefix='Greetings', suffix='!!!')
print("\nDataFrame after applying custom_greeting to each row:\n", df)


DataFrame after applying age_classification to 'Age' column:
       Name    Race   Age         Role Age Group
0    Frodo  Hobbit    50  Ring-bearer     Young
1      Sam  Hobbit    38     Gardener     Young
2  Gandalf  Wizard  2019       Wizard   Ancient
3  Aragorn   Human    87         King     Young
4  Legolas     Elf  2931       Archer   Ancient

DataFrame after applying custom_greeting to each row:
       Name    Race   Age         Role Age Group  \
0    Frodo  Hobbit    50  Ring-bearer     Young   
1      Sam  Hobbit    38     Gardener     Young   
2  Gandalf  Wizard  2019       Wizard   Ancient   
3  Aragorn   Human    87         King     Young   
4  Legolas     Elf  2931       Archer   Ancient   

                          Greeting  
0    Greetings Frodo the Hobbit!!!  
1      Greetings Sam the Hobbit!!!  
2  Greetings Gandalf the Wizard!!!  
3   Greetings Aragorn the Human!!!  
4     Greetings Legolas the Elf!!!  


# 8. Merging DataFrames

Merging DataFrames is a common task in data analysis, enabling the combination of multiple DataFrames based on common columns or indices. pandas provides several methods for merging, including inner, left, right, and outer joins.

### Inner Join
An inner join returns only the rows that have matching values in both DataFrames.

### Left Join
A left join returns all rows from the left DataFrame and matched rows from the right DataFrame. Unmatched rows in the right DataFrame will have NaN values.

### Right Join
A right join returns all rows from the right DataFrame and matched rows from the left DataFrame. Unmatched rows in the left DataFrame will have NaN values.

### Outer Join
An outer join returns all rows when there is a match in either left or right DataFrame. Unmatched rows will have NaN values.

### Anti-Join Methods
Anti-join methods exclude rows that do not have matches in both DataFrames.


In [8]:
# Import pandas library
import pandas as pd

# Creating DataFrames for merging examples
data_left = {
    'Name': ['Frodo', 'Sam', 'Gandalf', 'Aragorn', 'Legolas'],
    'Weapon': ['Sting', 'Sword', 'Staff', 'Anduril', 'Bow']
}
data_right = {
    'Name': ['Frodo', 'Sam', 'Gandalf', 'Gimli', 'Legolas'],
    'Companion': ['Sam', 'Frodo', 'Aragorn', 'Legolas', 'Gimli']
}

df_left = pd.DataFrame(data_left)
df_right = pd.DataFrame(data_right)

# Inner Join
inner_join = pd.merge(df_left, df_right, on='Name', how='inner')
print("Inner Join:\n", inner_join)

# Left Join
left_join = pd.merge(df_left, df_right, on='Name', how='left')
print("\nLeft Join:\n", left_join)

# Right Join
right_join = pd.merge(df_left, df_right, on='Name', how='right')
print("\nRight Join:\n", right_join)

# Outer Join
outer_join = pd.merge(df_left, df_right, on='Name', how='outer')
print("\nOuter Join:\n", outer_join)

# Anti-Join Methods
# Left Anti-Join: rows in df_left not in df_right
left_anti_join = df_left[~df_left['Name'].isin(df_right['Name'])]
print("\nLeft Anti-Join:\n", left_anti_join)

# Right Anti-Join: rows in df_right not in df_left
right_anti_join = df_right[~df_right['Name'].isin(df_left['Name'])]
print("\nRight Anti-Join:\n", right_anti_join)


Inner Join:
       Name Weapon Companion
0    Frodo  Sting       Sam
1      Sam  Sword     Frodo
2  Gandalf  Staff   Aragorn
3  Legolas    Bow     Gimli

Left Join:
       Name   Weapon Companion
0    Frodo    Sting       Sam
1      Sam    Sword     Frodo
2  Gandalf    Staff   Aragorn
3  Aragorn  Anduril       NaN
4  Legolas      Bow     Gimli

Right Join:
       Name Weapon Companion
0    Frodo  Sting       Sam
1      Sam  Sword     Frodo
2  Gandalf  Staff   Aragorn
3    Gimli    NaN   Legolas
4  Legolas    Bow     Gimli

Outer Join:
       Name   Weapon Companion
0  Aragorn  Anduril       NaN
1    Frodo    Sting       Sam
2  Gandalf    Staff   Aragorn
3    Gimli      NaN   Legolas
4  Legolas      Bow     Gimli
5      Sam    Sword     Frodo

Left Anti-Join:
       Name   Weapon
3  Aragorn  Anduril

Right Anti-Join:
     Name Companion
3  Gimli   Legolas
