# Introduction to Pandas and Data Manipulation

This notebook provides a comprehensive introduction to pandas, a powerful data manipulation library in Python. We'll cover various concepts and techniques using the Titanic dataset as our primary example.

In [ ]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Set pandas to display all columns
pd.set_option('display.max_columns', None)

## 1. Loading Data

Pandas provides various functions to load data from different file formats. Let's explore some of them.

In [ ]:
# Load CSV file
titanic_csv = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print("CSV data:")
print(titanic_csv.head())

# Load JSON file (assuming we have a JSON file of the Titanic data)
titanic_json = pd.read_json("titanicdata.json")
print("\nJSON data:")
print(titanic_json.head())

# Load data from a SQL database (this is just an example, not runnable without a database)
# import sqlite3
# conn = sqlite3.connect('titanic.db')
# titanic_sql = pd.read_sql_query("SELECT * FROM passengers", conn)
# print("\nSQL data:")
# print(titanic_sql.head())

# For this tutorial, we'll use the CSV data
titanic = titanic_csv

## 2. Basic DataFrame Operations

Let's explore some basic operations we can perform on our DataFrame.

In [ ]:
# Display the first few rows
print(titanic.head())

# Display the last few rows
print(titanic.tail())

# Get basic information about the DataFrame
print(titanic.info())

# Get statistical summary of numerical columns
print(titanic.describe())

# Get column names
print(titanic.columns)

# Get the shape of the DataFrame (rows, columns)
print(titanic.shape)

## 3. Creating DataFrames

While we've loaded our main dataset from a file, let's see how we can create DataFrames from scratch.

In [ ]:
# Create a DataFrame from a list
list_df = pd.DataFrame([1, 2, 3, 4], columns=['Numbers'])
print("DataFrame from list:")
print(list_df)

# Create a DataFrame from a dictionary
dict_df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print("\nDataFrame from dictionary:")
print(dict_df)

# Create a DataFrame with a custom index
indexed_df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}, index=['P1', 'P2', 'P3'])
print("\nDataFrame with custom index:")
print(indexed_df)

## 4. Selecting Data

Pandas provides multiple ways to select data from a DataFrame.

In [ ]:
# Select a single column
print(titanic['Name'].head())

# Select multiple columns
print(titanic[['Name', 'Age', 'Sex']].head())

# Select rows by position using iloc
print(titanic.iloc[0:5, 0:3])  # First 5 rows, first 3 columns

# Select rows by label using loc
print(titanic.loc[0:4, ['Name', 'Age', 'Sex']])  # First 5 rows, specified columns

# Boolean indexing
print(titanic[titanic['Age'] > 70])  # Passengers over 70 years old

## 5. Adding and Removing Columns

In [ ]:
# Add a new column
titanic['Age_Group'] = pd.cut(titanic['Age'], bins=[0, 18, 35, 50, 100], labels=['Child', 'Young Adult', 'Adult', 'Senior'])
print(titanic[['Name', 'Age', 'Age_Group']].head())

# Remove a column
titanic_no_cabin = titanic.drop('Cabin', axis=1)
print(titanic_no_cabin.columns)

# Rename columns
titanic_renamed = titanic.rename(columns={'Pclass': 'PassengerClass', 'SibSp': 'SiblingsSpouses'})
print(titanic_renamed.columns)

## 6. Handling Missing Data

In [ ]:
# Check for missing values
print(titanic.isnull().sum())

# Fill missing values
titanic_filled = titanic.fillna({'Age': titanic['Age'].mean(), 'Embarked': 'S'})
print(titanic_filled.isnull().sum())

# Drop rows with missing values
titanic_dropped = titanic.dropna()
print(f"Original shape: {titanic.shape}, Shape after dropping NA: {titanic_dropped.shape}")

## 7. Grouping and Aggregation

In [ ]:
# Group by passenger class and calculate mean age and fare
class_stats = titanic.groupby('Pclass')[['Age', 'Fare']].mean()
print(class_stats)

# Count survivors by sex
survival_by_sex = titanic.groupby(['Sex', 'Survived'])['Survived'].count().unstack()
print(survival_by_sex)

# Multiple aggregations
age_stats = titanic.groupby('Pclass')['Age'].agg(['mean', 'min', 'max'])
print(age_stats)

## 8. Merging and Joining DataFrames

In [ ]:
# Create two sample DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3, 4], 'Name': ['Alice', 'Bob', 'Charlie', 'David']})
df2 = pd.DataFrame({'ID': [1, 2, 3, 5], 'Age': [25, 30, 35, 40]})

# Inner join
inner_join = pd.merge(df1, df2, on='ID')
print("Inner Join:")
print(inner_join)

# Outer join
outer_join = pd.merge(df1, df2, on='ID', how='outer')
print("\nOuter Join:")
print(outer_join)

# Left join
left_join = pd.merge(df1, df2, on='ID', how='left')
print("\nLeft Join:")
print(left_join)

## 9. Reshaping Data

In [ ]:
# Create a sample DataFrame
data = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Product': ['A', 'B', 'A', 'B'],
    'Sales': [100, 150, 120, 180]
})

# Pivot the data
pivoted = data.pivot(index='Date', columns='Product', values='Sales')
print("Pivoted Data:")
print(pivoted)

# Melt the pivoted data back
melted = pivoted.reset_index().melt(id_vars=['Date'], var_name='Product', value_name='Sales')
print("\nMelted Data:")
print(melted)

## 10. Data Visualization with Pandas

Pandas integrates well with matplotlib for basic plotting.

In [ ]:
import matplotlib.pyplot as plt

# Bar plot of survival count by passenger class
survival_by_class = titanic.groupby('Pclass')['Survived'].sum()
survival_by_class.plot(kind='bar')
plt.title('Survival Count by Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Number of Survivors')
plt.show()

# Histogram of passenger ages
titanic['Age'].hist(bins=20)
plt.title('Distribution of Passenger Ages')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

# Scatter plot of age vs fare
plt.scatter(titanic['Age'], titanic['Fare'])
plt.title('Age vs Fare')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.show()

## Conclusion

This notebook has covered the basics of pandas and data manipulation using the Titanic dataset. We've explored loading data, basic DataFrame operations, selecting and filtering data, handling missing values, grouping and aggregation, merging DataFrames, reshaping data, and basic visualization. These skills form the foundation for more advanced data analysis and machine learning tasks.