# Introduction to Pandas

Objective: In this tutorial, we will delve deeper into Pandas library, understanding their powerful data manipulation and analysis functionalities.

Duration: Approximately 1 hour

## **Introduction to Pandas**

**a. Introducing Pandas for Data Manipulation and Analysis:**

Pandas is a powerful and widely-used open-source library in Python for data manipulation and analysis. It provides easy-to-use data structures and functions designed to handle and process structured data efficiently. Pandas is especially popular in the fields of data science, data analysis, and machine learning due to its versatility and performance.

Key features and advantages of Pandas include:

1. **Data Alignment:** Pandas automatically aligns data based on labels, simplifying data manipulation even with missing values.

2. **Flexible Data Handling:** Pandas can handle various data types, including numerical, textual, and categorical data, making it suitable for a wide range of data analysis tasks.

3. **Efficient Operations:** Pandas is optimized for speed, making it capable of handling large datasets quickly.

4. **Data Integration:** It seamlessly integrates with other Python libraries like NumPy, Matplotlib, and scikit-learn, creating a powerful ecosystem for data analysis.

5. **Data Input/Output:** Pandas provides tools to read and write data from various file formats, including CSV, Excel, SQL databases, and more.

**b. Pandas' Two Primary Data Structures: Series and DataFrame:**

Pandas introduces two primary data structures that form the building blocks of data analysis:

**1. Series:**
A Series is a one-dimensional labeled array capable of holding data of any type (e.g., integers, strings, floats, etc.). It consists of two main components: the data and the index. The index labels each element in the Series, enabling quick and easy data retrieval.

```python
import pandas as pd

# Creating a Series from a Python list
data = [10, 20, 30, 40, 50]
series = pd.Series(data)

print(series)
```

Output:
```
0    10
1    20
2    30
3    40
4    50
dtype: int64
```

**2. DataFrame:**
A DataFrame is a two-dimensional labeled data structure, similar to a spreadsheet or SQL table. It consists of rows and columns, where each column can hold data of different types. DataFrames allow for flexible and intuitive data manipulation, filtering, and analysis.

```python
import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 75000, 80000]
}

df = pd.DataFrame(data)

print(df)
```

Output:
```
       Name  Age  Salary
0     Alice   25   50000
1       Bob   30   60000
2  Charlie   35   75000
3     David   40   80000
```

DataFrames provide an organized and structured way to work with tabular data, making data analysis tasks more intuitive and efficient.

Both Series and DataFrame come with a rich set of built-in functions and methods, making Pandas a powerful tool for data manipulation, cleaning, filtering, grouping, and aggregation, which are crucial steps in the data analysis process. With Pandas, users can easily handle real-world datasets, perform exploratory data analysis, and prepare data for modeling and visualization.

## **Working with Pandas DataFrame**

**a. Creating a DataFrame from Different Data Sources:**

Pandas allows you to create a DataFrame from various data sources, such as lists, dictionaries, and CSV files.

**From Lists:**

```python
import pandas as pd

# Creating a DataFrame from lists
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 75000, 80000]
}

df = pd.DataFrame(data)

print(df)
```

**From Dictionaries:**

```python
import pandas as pd

# Creating a DataFrame from dictionaries
data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
}

df = pd.DataFrame(data)

print(df)
```

**From CSV Files:**

```python
import pandas as pd

# Reading a CSV file and creating a DataFrame
df = pd.read_csv('data.csv')

print(df)
```

**b. Basic DataFrame Operations:**

**Selecting Columns:**

```python
import pandas as pd

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 75000, 80000]
}

df = pd.DataFrame(data)

# Selecting a single column
print(df['Name'])

# Selecting multiple columns
print(df[['Name', 'Age']])
```

**Filtering Data:**

```python
import pandas as pd

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 75000, 80000]
}

df = pd.DataFrame(data)

# Filtering data based on a condition
filtered_data = df[df['Age'] > 30]

print(filtered_data)
```

**Handling Missing Values:**

```python
import pandas as pd

# Creating a DataFrame with missing values
data = {
    'A': [1, 2, None, 4],
    'B': [5, None, 7, 8],
    'C': [9, 10, 11, None]
}

df = pd.DataFrame(data)

# Checking for missing values
print(df.isnull())

# Dropping rows with missing values
df_cleaned = df.dropna()

print(df_cleaned)
```

**c. Indexing and Selection of Data in DataFrames:**

Pandas assigns a default integer index to each row when creating a DataFrame. However, you can set a specific column as the index for better data selection and manipulation.

```python
import pandas as pd

# Creating a DataFrame with a specific index
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 75000, 80000]
}

df = pd.DataFrame(data, index=['ID1', 'ID2', 'ID3', 'ID4'])

print(df)

# Accessing data using the index label
print(df.loc['ID2'])

# Accessing data using the integer index
print(df.iloc[1])

# Slicing rows based on the index
print(df.loc['ID2':'ID3'])
```

Indexing in Pandas provides powerful ways to access and manipulate data. The `loc` attribute is used for label-based indexing, while the `iloc` attribute is used for integer-based indexing. Understanding and using proper indexing techniques will enable efficient data retrieval and manipulation in DataFrames.

## **Data Wrangling with Pandas**

**a. Data Cleaning Techniques, Handling Missing Data, and Data Transformation:**

Data cleaning is an essential step in the data analysis process. It involves identifying and correcting errors, inconsistencies, and missing values in the dataset to ensure accurate and reliable analysis. Pandas provides several methods to handle data cleaning tasks efficiently.

**Handling Missing Data:**

Pandas offers methods to deal with missing data, such as `dropna()` to remove rows with missing values and `fillna()` to fill missing values with specific values.

```python
import pandas as pd

# Creating a DataFrame with missing values
data = {
    'A': [1, 2, None, 4],
    'B': [5, None, 7, 8],
    'C': [9, 10, 11, None]
}

df = pd.DataFrame(data)

# Dropping rows with any missing values
df_cleaned = df.dropna()

# Filling missing values with a specific value (e.g., 0)
df_filled = df.fillna(0)

print("Original DataFrame:")
print(df)
print("DataFrame after Dropping Rows with Missing Values:")
print(df_cleaned)
print("DataFrame after Filling Missing Values:")
print(df_filled)
```

**Data Transformation:**

Pandas provides various functions for data transformation, such as `map()`, `apply()`, and `replace()`. These functions are useful for converting data, applying functions to elements, and replacing values.

```python
import pandas as pd

# Creating a DataFrame
data = {
    'Gender': ['Male', 'Female', 'Male', 'Male', 'Female'],
    'Age': [25, 30, 35, 40, 28]
}

df = pd.DataFrame(data)

# Mapping gender values to numeric values
gender_map = {'Male': 0, 'Female': 1}
df['Gender'] = df['Gender'].map(gender_map)

# Applying a function to calculate age category
def age_category(age):
    if age < 30:
        return 'Young'
    else:
        return 'Adult'

df['Age Category'] = df['Age'].apply(age_category)

# Replacing values in a column
df['Gender'].replace({0: 'M', 1: 'F'}, inplace=True)

print(df)
```

**b. Grouping and Aggregation Operations using Pandas:**

Grouping data and performing aggregation operations are common tasks in data analysis. Pandas allows us to group data based on one or more columns and apply aggregation functions like sum, mean, count, etc.

```python
import pandas as pd

# Creating a DataFrame
data = {
    'Category': ['A', 'B', 'A', 'B', 'A'],
    'Value': [10, 20, 15, 25, 30]
}

df = pd.DataFrame(data)

# Grouping data based on the 'Category' column and calculating the mean and sum of 'Value'
grouped_data = df.groupby('Category').agg({'Value': ['mean', 'sum']})

print(grouped_data)
```

In this example, we group the data based on the 'Category' column and calculate the mean and sum of the 'Value' column for each category.

Data cleaning and transformation are vital steps in the data analysis process to ensure data quality and consistency. Grouping and aggregation operations are essential for understanding data distributions, patterns, and trends. Pandas provides a wide range of functionalities to efficiently handle these tasks, making it a valuable tool for data scientists and analysts.

## **Practice Exercises**

**Exercise 1: Handling Missing Data**

Given a DataFrame with missing values, use Pandas to:
a) Drop rows with any missing values.
b) Fill missing values in a specific column with the mean of that column.

**Exercise 2: Data Transformation**

Given a DataFrame with a 'Date' column in the format 'YYYY-MM-DD', transform the 'Date' column into separate 'Year', 'Month', and 'Day' columns.

**Exercise 3: Grouping and Aggregation**

Given a DataFrame with information about employees (Name, Department, Salary), use Pandas to group the data by Department and calculate the average salary for each department.

**Exercise 4: Merge and Concatenate DataFrames**

Create two DataFrames with related information (e.g., employee details and department details). Use Pandas to merge or concatenate these DataFrames based on common columns.

**Exercise 5: Pivot Tables**

Given a DataFrame with sales data (Date, Product, Quantity, Price), use Pandas to create a pivot table that shows the total sales (Quantity * Price) for each product on each date.

**Exercise 6: Data Visualization with Pandas**

Create a DataFrame with numerical data and use Pandas to create various plots, such as line plots, bar charts, scatter plots, and histograms.

**Exercise 7: String Manipulation**

Given a DataFrame with a column containing names (e.g., "First Name Last Name"), use Pandas to split the names into separate 'First Name' and 'Last Name' columns.

**Exercise 8: Filter and Select Data**

Given a DataFrame, use Pandas to filter and select rows based on specific conditions and select specific columns.

**Exercise 9: Merge with SQL-Like Operations**

Given two DataFrames with related information, use Pandas to perform merge operations with different join types (inner, outer, left, right).

**Exercise 10: Reshaping Data**

Given a DataFrame with data in a wide format, use Pandas to transform it into a long format using the `melt()` function.

## **Recap and Q&A**