# Introduction to Pandas and Data Structures

## Engr Muhammad Saqlain

## What is Pandas?

[Pandas](https://pandas.pydata.org/) is a powerful and widely-used open-source data manipulation and analysis library for Python. It provides data structures for efficiently storing large datasets and tools for working with structured data seamlessly. Developed by Wes McKinney, Pandas is built on top of the NumPy library and is a fundamental tool in the data science and machine learning ecosystem.

## Why is Pandas Important in Data Analysis?

1. **Ease of Data Handling:** Pandas simplifies the process of importing, cleaning, and manipulating structured data. It provides easy-to-use data structures, primarily the DataFrame, that are highly efficient and intuitive.

2. **Flexibility in Data Structures:** With Pandas, you can work with Series (1-dimensional labeled arrays) and DataFrames (2-dimensional labeled data structures) to handle a variety of data types, including numerical, categorical, and textual data.

3. **Data Cleaning and Preprocessing:** Pandas offers a plethora of functions for handling missing data, converting data types, and performing operations that are crucial for preparing data for analysis.

4. **Powerful Data Analysis Tools:** Pandas enables powerful data analysis through its groupby operations, pivot tables, and statistical functions. It facilitates exploratory data analysis (EDA) and the extraction of meaningful insights from datasets.

5. **Integration with Other Libraries:** Pandas seamlessly integrates with other popular Python libraries, such as NumPy, Matplotlib, and scikit-learn, providing a comprehensive ecosystem for data analysis and machine learning.

6. **Time Series Analysis:** Pandas excels in handling time series data, making it a valuable tool for tasks like financial analysis, stock market predictions, and trend analysis.

In this session, we will dive into the basics of Pandas, exploring its data structures, learning how to import and export data, and performing fundamental operations for data manipulation.


# Installation and Setup

## Installing Pandas and Setting Up the Environment

[Pandas](https://pandas.pydata.org/) is easy to install using Python's package manager, pip. Use following command in JN to install Pandas. 

!pip install pandas


# Data Structures in Pandas

## Series

A `Series` in Pandas is a one-dimensional labeled array capable of holding any data type. It is essentially a labeled column of data and can be created from various data sources. Let's explore how to create a Pandas Series:

### Creating a Series from a List


In [None]:
import pandas as pd

# Creating a Series from a list of numbers
numbers = [1, 3, 5, 7, 9]
series_from_list = pd.Series(numbers)
series_from_list

###  Creating a Series with Custom Index

In [None]:
# Creating a Series with custom index
data = {'a': 10, 'b': 20, 'c': 30, 'd': 40}
series_custom_index = pd.Series(data)
series_custom_index

###  Creating a Series from a NumPy Array

In [None]:
import numpy as np

# Creating a Series from a NumPy array
numpy_array = np.array([2, 4, 6, 8, 10])
series_from_numpy = pd.Series(numpy_array)
series_from_numpy

### Creating a Series with Explicit Index

In [None]:
# Creating a Series with explicit index
data_explicit_index = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data_explicit_index

### Series Indexing

Pandas Series support various methods of indexing, allowing you to access and manipulate data efficiently.

####  Standard Indexing

In [None]:
# Creating a Series for indexing examples
sample_series = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])

# Accessing a single element using standard indexing
sample_series[0]

In [None]:
sample_series[1:4]

 ### Label-based Indexing

In [None]:
# Accessing elements using custom labels
sample_series['b']


In [None]:
 sample_series[['c', 'd']]

### Slicing

In [None]:
# Slicing the Series
sample_series[1:4]


### Conditional or boolean Indexing 

In [None]:
# Conditional indexing (selecting elements greater than 20)
sample_series[sample_series > 20]


### Fancy Indexing

In [None]:
# Fancy indexing using a list of indices
indices = ['a', 'c', 'd']
sample_series[indices]

# Basic Operations on Pandas Series

Pandas Series support a variety of operations, making it easy to perform computations and manipulations on your data.

### Arithmetic Operations

In [None]:
# Creating two Series for arithmetic operations
series1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
series2 = pd.Series([5, 6, 7, 8], index=['a', 'b', 'c', 'd'])

In [None]:

# Addition
series1 + series2


In [None]:
# Subtraction
series1 - series2


In [None]:
# Multiplication
series1 * series2

In [None]:
# Division
series1 / series2


### Element-wise Functions

In [None]:
# Square root of each element
np.sqrt(series1)

In [None]:
# Exponential function
np.exp(series1)

In [None]:
# Logarithm (natural logarithm)
np.log(series2)

 ### Aggregation Functions

In [None]:
series1.sum()  # Sum of all elements

In [None]:
series2.mean()  # Mean of all elements

In [None]:
series1.max()  # Maximum value

In [None]:
series2.min()  # Minimum value

### Element-wise Comparison

In [None]:
# Element-wise comparison between two Series
comparison_result = series1 > series2

comparison_result

### Updating Values

In [None]:
# Updating values in a Series
series1['a'] = 10  # Updating a specific element

series1


# DataFrame Creation in Pandas

Pandas DataFrames are two-dimensional labeled data structures that can store and manipulate data in tabular form. Let's explore various methods of creating DataFrames.

### Creating a DataFrame from a Dictionary

In [None]:
# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 28],
        'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']}

df_from_dict = pd.DataFrame(data)

df_from_dict


### Creating a DataFrame from a List of Lists

In [None]:
# Creating a DataFrame from a list of lists
data_list = [['Alice', 25, 'New York'],
             ['Bob', 30, 'San Francisco'],
             ['Charlie', 35, 'Los Angeles'],
             ['David', 28, 'Chicago']]

df_from_list = pd.DataFrame(data_list, columns=['Name', 'Age', 'City'])

df_from_list

### Creating a DataFrame from a NumPy Array

In [None]:
# Creating a DataFrame from a NumPy array
data_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

df_from_array = pd.DataFrame(data_array, columns=['A', 'B', 'C'])

df_from_array


### Creating an Empty DataFrame with Columns

In [None]:
# Creating an empty DataFrame with columns
columns = ['Name', 'Age', 'City']
empty_df = pd.DataFrame(columns=columns)

empty_df


# DataFrame Indexing in Pandas

Indexing in Pandas DataFrames involves selecting and manipulating data based on rows and columns. Let's explore different indexing methods.

### Indexing Columns


In [None]:
# Creating a DataFrame for indexing examples
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 28],
        'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']}

df = pd.DataFrame(data)


In [None]:
df

In [None]:

# Indexing a single column
df['Name']


In [None]:
# Indexing multiple columns
df[['Name', 'Age']]

### Indexing Rows

In [None]:
# Indexing rows using iloc (integer-location based indexing)
first_row = df.iloc[0]  # Selecting the first row. Try  iloc[0,0]
first_row

In [None]:
df

In [None]:
# Indexing rows using loc (label-based indexing)
second_row = df.loc[2]  # Selecting the second row
second_row

### Conditional or boolean indexing

In [None]:
# Conditional indexing on DataFrame
filtered_data = df[df['Age'] > 30]

filtered_data


### Indexing with at and iat

In [None]:
# Indexing with `at` (label-based)
df.at[1, 'Name']

In [None]:
# Indexing with `iat` (integer-location based)
df.iat[1, 0]

### Setting and Resetting Index

In [None]:
df

In [None]:
# Setting a new index
df_with_index = df.set_index('Name')
df_with_index

In [None]:
df

In [None]:
# Resetting index
df_reset_index = df_with_index.reset_index()
df_reset_index

# Basic Operations on Pandas DataFrames

Pandas DataFrames support a wide range of operations for data manipulation and analysis. Let's explore some fundamental operations.

### Accessing Columns


In [None]:
# Creating a DataFrame for basic operations examples
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 28],
        'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']}

df = pd.DataFrame(data)

df

### Adding a New Column

In [None]:
# Adding a new column
df['Salary'] = [50000, 60000, 75000, 48000]

df


### Descriptive Statistics

In [None]:
# Descriptive statistics on a DataFrame
summary_statistics = df.describe()

summary_statistics


### Transposing a DataFrame

In [None]:
# Transposing a DataFrame (switching rows and columns)
transposed_df = df.T

transposed_df


###  Sorting Data

In [None]:
# Sorting a DataFrame by a specific column
sorted_df = df.sort_values(by='Age', ascending=False)

sorted_df


### Filtering Data

In [None]:
# Filtering data based on a condition
filtered_data = df[df['Age'] > 30]

filtered_data


### Grouping and Aggregation

In [None]:
# Grouping by a column and calculating the mean
grouped_data = df.groupby('City').mean()

grouped_data


In [None]:
df

### Dropping columns

In [None]:
# Deleting a column
df_without_age = df.drop('Age', axis=1)

df_without_age


In [None]:
df