# Introduction to Pandas

ðŸ¤– `Notebook by` [Ihsanul Haque](https://www.linkedin.com/in/ihsanul09/)

âœ… `Machine Learning Source Codes` [GitHub](https://https://github.com/ihsanulcode/ML-Batch-2)

ðŸ“Œ `Machine Learning from Scratch` [Course Outline](https://https://docs.google.com/document/d/15mGNTUSlWQsy4TzcLZUdYedpCMO5KiVq1USaDprHaIc/edit?usp=sharing)

# What is Pandas?

Pandas is a popular open-source Python library used for data manipulation and analysis. It provides powerful data structures and tools for working with structured data, primarily in the form of data frames (tables) that allow you to perform operations like filtering, cleaning, transforming, and analyzing data. Pandas is widely used in data science, machine learning, and other fields.

# Why Use Pandas?

Pandas offers several benefits that make it a preferred choice for data manipulation and analysis:



1. **Data Structures:** Pandas introduces two key data structures, Series (1-dimensional labeled array) and DataFrame (2-dimensional labeled data structure), which are flexible and powerful for handling data.

2. **Ease of Data Handling:** It simplifies common data manipulation tasks like indexing, filtering, reshaping, aggregating, and cleaning data, making it efficient and straightforward.

3. **Integration with Other Libraries:** Pandas integrates well with other Python libraries used in data science, such as NumPy, Matplotlib, and scikit-learn, allowing seamless data transformation and analysis within these environments.

4. **Handling Missing Data:** Pandas provides functionalities to handle missing or incomplete data, making it easier to clean and preprocess datasets without compromising the analysis.

5. **Input/Output Tools:** Pandas supports reading and writing data from various file formats like CSV, Excel, SQL databases, JSON, and more, making it easy to work with different data sources.

6. **Performance:** While there might be trade-offs between speed and convenience, Pandas is generally optimized for performance when working with medium-sized datasets. For larger datasets, developers often combine Pandas with other libraries like Dask for distributed computing.

# What Can Pandas Do?

* Data Loading
* Data Exploration
* Data Cleaning
* Data Manipulation
* Feature Enginnering
* Data Preprocessing
* Integration with ML Libraries



## Installation of Pandas
Make sure that Python is already installed.

Install it using command line: `pip install pandas`

Install in notebook: `!pip install pandas`



## Import Pandas
Once Pandas is installed, import it in your applications by adding the `import` keyword: `import pandas`

In [1]:
import pandas as pd

## Pandas Series

In Pandas, a Series is a one-dimensional labeled array capable of holding data of any type (integer, float, string, Python objects, etc.). It's similar to a Python list or a one-dimensional NumPy array but provides additional features.

In [2]:
# Creating a Series from a list
data = [10,20,30,40,50]
series = pd.Series(data)
series

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [3]:
type(series)

pandas.core.series.Series

In [4]:
# Custom Indexing in Series
custom_index = ['A', 'B', 'C', 'D', 'E']
series_with_custom_index = pd.Series(data, index=custom_index)
series_with_custom_index

A    10
B    20
C    30
D    40
E    50
dtype: int64

In [5]:
print(series_with_custom_index['B']) # Accessing the item with index 'B'

20


## Pandas DataFrames

Pandas DataFrames are two-dimensional, size-mutable, and potentially heterogeneous tabular data structures with labeled axes (rows and columns). They resemble a spreadsheet or SQL table, and they consist of rows and columns, where each column can hold different types of data.

In [6]:
import pandas as pd

# Creating a DataFrame from a Dictionary
data = {
    'Name' : ['Alice', 'Bob', 'Ihsanul'],
    'Age' : [12,23,13],
    'City' : ["NY", "UK", "Dhaka"]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,12,NY
1,Bob,23,UK
2,Ihsanul,13,Dhaka


In [7]:
# Apply custom index on df
custom_index = ['ID1', "ID2", "ID3"]
df_with_custom_index = pd.DataFrame(data, index=custom_index)
df_with_custom_index

Unnamed: 0,Name,Age,City
ID1,Alice,12,NY
ID2,Bob,23,UK
ID3,Ihsanul,13,Dhaka


## Data Selection and Indexing


### Column selection

In [8]:
import pandas as pd

# Creating a DataFrame from a Dictionary
data = {
    'Name' : ['Alice', 'Bob', 'Ihsanul'],
    'Age' : [12,23,13],
    'City' : ["NY", "UK", "Dhaka"]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,12,NY
1,Bob,23,UK
2,Ihsanul,13,Dhaka


In [9]:
# Selecting specific columns by name
df[['Name']]

Unnamed: 0,Name
0,Alice
1,Bob
2,Ihsanul


In [10]:
df[['Name','Age']]

Unnamed: 0,Name,Age
0,Alice,12
1,Bob,23
2,Ihsanul,13


### iloc and loc

In [11]:
row_2 = df.iloc[2]
row_2

Name    Ihsanul
Age          13
City      Dhaka
Name: 2, dtype: object

In [12]:
# selecting rows using iloc[]
selected_rows = df.iloc[1:3] # (integer-based indexing)
selected_rows

Unnamed: 0,Name,Age,City
1,Bob,23,UK
2,Ihsanul,13,Dhaka


In [13]:
# Selecting rows using loc[] (label-based indexing)
# Apply custom index on df
custom_index = ['ID1', "ID2", "ID3"]
df_with_custom_index = pd.DataFrame(data, index=custom_index)
df_with_custom_index

Unnamed: 0,Name,Age,City
ID1,Alice,12,NY
ID2,Bob,23,UK
ID3,Ihsanul,13,Dhaka


In [15]:
df_with_custom_index.loc['ID2':'ID3']

Unnamed: 0,Name,Age,City
ID2,Bob,23,UK
ID3,Ihsanul,13,Dhaka


### query
The .query() method allows you to filter rows based on a query expression

In [16]:
import pandas as pd

# Creating a DataFrame from a Dictionary
data = {
    'Name' : ['Alice', 'Bob', 'Ihsanul'],
    'Age' : [12,23,13],
    'City' : ["NY", "UK", "Dhaka"]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,12,NY
1,Bob,23,UK
2,Ihsanul,13,Dhaka


In [18]:
# Filtering rows using query expression
filtered = df.query('Age>20')
filtered

Unnamed: 0,Name,Age,City
1,Bob,23,UK


## Data Exploration and Information

### info()

The df.info() method in Pandas provides a concise summary of a DataFrame, including the index dtype and column dtypes, non-null values, and memory usage. It's a handy way to quickly get an overview of the DataFrame's structure and the data it contains.

In [19]:
import pandas as pd

# Creating a DataFrame from a Dictionary
data = {
    'Name' : ['Alice', 'Bob', 'Ihsanul'],
    'Age' : [12,23,13],
    'City' : ["NY", "UK", "Dhaka"]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,12,NY
1,Bob,23,UK
2,Ihsanul,13,Dhaka


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   City    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes


### describe()
The describe() method in Pandas generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution. It provides statistical information about numerical columns in a DataFrame.

In [21]:
df.describe()

Unnamed: 0,Age
count,3.0
mean,16.0
std,6.082763
min,12.0
25%,12.5
50%,13.0
75%,18.0
max,23.0


### head() and tail()
The head() and tail() methods in Pandas are used to view a small portion of a DataFrame. They are helpful for quickly examining the beginning or end of a DataFrame to get a sense of its structure or contents.

In [29]:
import pandas as pd

# Creating a DataFrame from a Dictionary
data = {
    'Name' : ['Alice', 'Bob', 'Ihsanul',"Hasan", "Mahadi", "Sami", "Sadi", "Siam"],
    'Age' : [12,23,13,12,2,13,4,5],
    'City' : ["NY", "UK", "Dhaka","NY", "UK", "Dhaka","NY", "UK"]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,12,NY
1,Bob,23,UK
2,Ihsanul,13,Dhaka
3,Hasan,12,NY
4,Mahadi,2,UK
5,Sami,13,Dhaka
6,Sadi,4,NY
7,Siam,5,UK


In [30]:
df.head()

Unnamed: 0,Name,Age,City
0,Alice,12,NY
1,Bob,23,UK
2,Ihsanul,13,Dhaka
3,Hasan,12,NY
4,Mahadi,2,UK


In [31]:
df.head(6)

Unnamed: 0,Name,Age,City
0,Alice,12,NY
1,Bob,23,UK
2,Ihsanul,13,Dhaka
3,Hasan,12,NY
4,Mahadi,2,UK
5,Sami,13,Dhaka


In [32]:
df.tail()

Unnamed: 0,Name,Age,City
3,Hasan,12,NY
4,Mahadi,2,UK
5,Sami,13,Dhaka
6,Sadi,4,NY
7,Siam,5,UK


In [33]:
df.tail(3)

Unnamed: 0,Name,Age,City
5,Sami,13,Dhaka
6,Sadi,4,NY
7,Siam,5,UK


### value_counts()
The value_counts() method in Pandas is used to count the occurrences of unique values in a column of a DataFrame. It's particularly useful for understanding the distribution of values within a specific column.

In [34]:
df['City'].value_counts()

NY       3
UK       3
Dhaka    2
Name: City, dtype: int64

In [35]:
df['Age'].value_counts()

12    2
13    2
23    1
2     1
4     1
5     1
Name: Age, dtype: int64