# Pandas Basics Tutorial

# 1. What is Pandas?

Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation library built on top of the Python programming language.

In [2]:
import pandas as pd

# 2. Why and When is Pandas Used?

Pandas is used for structured data operations and manipulations. It is extensively used for data preparation, cleaning, and analysis.
It is quite efficient for handling and analyzing data fitting into memory, and enable "interactive" analysis

# 3. Difference and Benefit Against Using Numpy

Pandas provides high-level data structures and functions designed to make data analysis fast and easy. While Numpy is focused on numerical computations, Pandas is designed for working with tabular or heterogeneous data. Pandas DataFrames allow for more flexible data manipulation and are equipped with a large number of methods for powerful data analysis.

Note that often a Pandas series (that is, a column) is in fact a NumPy array.

Since Pandas 2.0 (April 2023) Arrow is also supported to store columns, and has a few advantages for performances and integration with other libraries like Polars.

In [3]:
import numpy as np
pd.Series([1, 3, 5, np.nan, 6, 8])

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

# 4. Pandas Series vs Pandas DataFrame

A Pandas Series is a one-dimensional array-like object that can hold many data types, whereas a DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

A Dataframe is similar to a table in a database or an Excel sheet, while a series is a column of it.

In [10]:
# Creating a series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
# Creating a DataFrame
data = {'A': [1, 2], 'B': [3, 4], 'Name': ['Andrew', 'Barbara']}
df = pd.DataFrame(data)

print(df.dtypes)

df

A        int64
B        int64
Name    object
dtype: object


Unnamed: 0,A,B,Name
0,1,3,Andrew
1,2,4,Barbara


# 5. Viewing Data Through .head() and .tail()

The `.head()` method shows the first few rows of the DataFrame, while `.tail()` shows the last few rows.

In [11]:
df.head()

Unnamed: 0,A,B,Name
0,1,3,Andrew
1,2,4,Barbara


# 6. What is the dtypes Command Doing?

The `dtypes` command shows the data type of each column in the DataFrame.

In [12]:
df.dtypes

A        int64
B        int64
Name    object
dtype: object

# 7. What Does the describe() Command Do?

The `describe()` command provides a summary of the statistics pertaining to the DataFrame columns.

In [13]:
df.describe()

Unnamed: 0,A,B
count,2.0,2.0
mean,1.5,3.5
std,0.707107,0.707107
min,1.0,3.0
25%,1.25,3.25
50%,1.5,3.5
75%,1.75,3.75
max,2.0,4.0


# 8. Understanding .columns, .index, and .values Attributes

The `.columns` attribute shows the names of the columns, `.index` displays the index of the DataFrame, and `.values` extracts the data from the DataFrame as a numpy array.

In [15]:
print(df.columns)
print(df.index)
print(df.values)

Index(['A', 'B', 'Name'], dtype='object')
RangeIndex(start=0, stop=2, step=1)
[[1 3 'Andrew']
 [2 4 'Barbara']]


# 9. Different Ways to Select a Column

You can select a column from a DataFrame in multiple ways, including `df['column_name']`, `df.column_name`, and `df.loc[:, 'column_name']`. The `.loc` method is recommended for more complex data selection operations.

In [16]:
print(df['A'])
print(df.A)
print(df.loc[:, 'A'])


0    1
1    2
Name: A, dtype: int64
0    1
1    2
Name: A, dtype: int64
0    1
1    2
Name: A, dtype: int64


# 10. What is .iloc and When is It Used?

The `.iloc` method is used for selecting rows and columns by their position. It is particularly useful when the index labels are not known.

In [18]:
# Using .iloc
print(df.iloc[0])  # Selects the first row
print(df.iloc[:, 0])  # Selects the first column


A            1
B            3
Name    Andrew
Name: 0, dtype: object
0    1
1    2
Name: A, dtype: int64
