# Pandas for Data Manipulation and Analysis

This notebook provides a comprehensive introduction to using Pandas, a powerful library for data manipulation and analysis in Python. We will cover various operations, including data inspection, selection, cleaning, aggregation, merging, visualization, and exporting.

## 1. Importing Pandas

First, let's import Pandas to start working with it.

In [None]:
import pandas as pd

## 2. Creating DataFrames

DataFrames are the main data structures in Pandas. Let's create a DataFrame from a dictionary.

In [None]:
# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 35, 40, 45],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}
df = pd.DataFrame(data)
df

## 3. Inspecting DataFrames

You can inspect DataFrames using various methods to understand the data better.

In [None]:
# Display the first few rows of the DataFrame
df.head()

# Display the summary of the DataFrame
df.info()

# Display the summary statistics
df.describe()

## 4. Data Selection and Filtering

Let's learn how to select specific rows and columns from a DataFrame.

In [None]:
# Selecting a single column
df['Name']

# Selecting multiple columns
df[['Name', 'City']]

# Filtering rows based on a condition
df[df['Age'] > 30]

## 5. Data Cleaning

Data cleaning involves handling missing values, removing duplicates, and correcting data types.

### 5.1. Handling Missing Values
You can handle missing values by either dropping them or filling them with a specific value.

In [None]:
# Creating a DataFrame with missing values
df_with_nan = pd.DataFrame({
    'Name': ['Alice', 'Bob', None, 'David'],
    'Age': [25, None, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', None]
})
print("DataFrame with NaN values:")
print(df_with_nan)

# Detecting missing values
print("\nMissing values in DataFrame:")
print(df_with_nan.isna())

# Dropping rows with missing values
df_dropped = df_with_nan.dropna()
print("\nDataFrame after dropping missing values:")
print(df_dropped)

# Filling missing values
df_filled = df_with_nan.fillna({'Age': df_with_nan['Age'].mean(), 'City': 'Unknown'})
print("\nDataFrame after filling missing values:")
print(df_filled)

### 5.2. Removing Duplicates
You can remove duplicate rows using the `drop_duplicates()` method.

In [None]:
# Creating a DataFrame with duplicate rows
df_with_duplicates = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
    'Age': [25, 30, 25, 35],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago']
})
print("DataFrame with duplicates:")
print(df_with_duplicates)

# Removing duplicate rows
df_no_duplicates = df_with_duplicates.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)

## 6. Data Aggregation

Data aggregation involves summarizing and grouping data based on specific criteria. Pandas provides powerful methods for aggregation and grouping.

### 6.1. Grouping Data
You can group data using the `groupby()` method and then perform aggregate functions like sum, mean, and count.

In [None]:
# Creating a DataFrame for grouping
df_grouping = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Age': [25, 30, 25, 35, 30],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']
})
print("DataFrame for grouping:")
print(df_grouping)

# Grouping by 'Name' and calculating mean age
grouped = df_grouping.groupby('Name').mean()
print("\nGrouped Data (mean age by Name):")
print(grouped)

### 6.2. Aggregation Functions
You can apply aggregation functions like `sum()`, `mean()`, `count()`, and `agg()` to grouped data.

In [None]:
# Aggregating data with multiple functions
aggregation = df_grouping.groupby('City').agg({
    'Age': ['mean', 'sum', 'count']
})
print("\nAggregated Data (mean, sum, count of Age by City):")
print(aggregation)

## 7. Merging and Joining DataFrames

Merging and joining DataFrames are essential operations for combining datasets based on common columns or indices.

### 7.1. Merging DataFrames
You can merge DataFrames using the `merge()` method, similar to SQL joins.

In [None]:
# Creating DataFrames for merging
df1 = pd.DataFrame({
    'Key': ['A', 'B', 'C', 'D'],
    'Value1': [1, 2, 3, 4]
})
df2 = pd.DataFrame({
    'Key': ['A', 'B', 'E'],
    'Value2': [5, 6, 7]
})

# Merging DataFrames on 'Key'
merged = pd.merge(df1, df2, on='Key', how='inner')
print("Merged DataFrame (inner join on 'Key'):")
print(merged)

### 7.2. Joining DataFrames
Joining is a convenient method for combining DataFrames on their index.

In [None]:
# Creating DataFrames for joining
df1 = pd.DataFrame({
    'Value1': [1, 2, 3]
}, index=['A', 'B', 'C'])
df2 = pd.DataFrame({
    'Value2': [4, 5, 6]
}, index=['A', 'B', 'D'])

# Joining DataFrames on index
joined = df1.join(df2, how='outer')
print("Joined DataFrame (outer join on index):")
print(joined)

## 8. Data Visualization

Pandas integrates with Matplotlib to provide convenient plotting capabilities directly from DataFrames.

In [None]:
# Importing Matplotlib for plotting
import matplotlib.pyplot as plt

# Creating a simple line plot
df_line = pd.DataFrame({
    'Year': [2010, 2011, 2012, 2013, 2014],
    'Value': [100, 200, 150, 300, 250]
})
df_line.plot(x='Year', y='Value', kind='line')
plt.title('Year vs Value')
plt.xlabel('Year')
plt.ylabel('Value')
plt.show()

### 8.1. Bar Plot
You can create a bar plot using the `plot()` method with the `kind='bar'` argument.

In [None]:
# Creating a bar plot
df_bar = pd.DataFrame({
    'Category': ['A', 'B', 'C', 'D'],
    'Values': [23, 45, 56, 78]
})
df_bar.plot(x='Category', y='Values', kind='bar', color='skyblue')
plt.title('Category vs Values')
plt.xlabel('Category')
plt.ylabel('Values')
plt.show()

### 8.2. Histogram
Histograms are useful for showing the distribution of a numerical dataset. You can create a histogram using the `plot()` method with the `kind='hist'` argument.

In [None]:
# Creating a histogram
df_hist = pd.DataFrame({
    'Age': [22, 25, 29, 30, 35, 40, 42, 45, 50, 55, 60]
})
df_hist.plot(kind='hist', bins=5, color='lightgreen')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

### 8.3. Scatter Plot
Scatter plots are useful for visualizing the relationship between two numerical variables. You can create a scatter plot using the `plot()` method with the `kind='scatter'` argument.

In [None]:
# Creating a scatter plot
df_scatter = pd.DataFrame({
    'Height': [150, 160, 165, 170, 175, 180, 185],
    'Weight': [50, 55, 60, 65, 70, 75, 80]
})
df_scatter.plot(x='Height', y='Weight', kind='scatter', color='red')
plt.title('Height vs Weight')
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.show()

## 9. Exporting Data

Pandas provides functionality to export data from DataFrames to various formats, such as CSV, Excel, and JSON. This is useful for saving your cleaned and processed data for later use or for sharing with others.

### 9.1. Exporting to CSV
You can export a DataFrame to a CSV file using the `to_csv()` method.

In [None]:
# Exporting DataFrame to CSV
df.to_csv('exported_data.csv', index=False)
print("DataFrame exported to 'exported_data.csv'")

### 9.2. Exporting to Excel
You can export a DataFrame to an Excel file using the `to_excel()` method.

In [None]:
# Exporting DataFrame to Excel
df.to_excel('exported_data.xlsx', index=False)
print("DataFrame exported to 'exported_data.xlsx'")

### 9.3. Exporting to JSON
You can export a DataFrame to a JSON file using the `to_json()` method.

In [None]:
# Exporting DataFrame to JSON
df.to_json('exported_data.json', orient='records')
print("DataFrame exported to 'exported_data.json'")

## Conclusion

This notebook has covered the basics of Pandas for data manipulation and analysis, including data inspection, selection, cleaning, aggregation, merging, visualization, and exporting. These skills are essential for data science and can be expanded upon with more complex datasets and operations.