# Pandas short introduction

Pandas is one of the most widely used libraries for handling and manipulating structured data, such as CSV files, Excel sheets, or SQL databases. It provides efficient data structures like DataFrame and Series for handling large datasets, making it easy to clean, manipulate, and analyze tabular data.

While PyTorch doesn't directly use Pandas DataFrames, they work together seamlessly.

Loading Data: Datasets are often loaded into a Pandas DataFrame from various sources (CSV, Excel, SQL, etc.) for easier manipulation and cleaning before converting them into a format compatible with PyTorch. For example, tabular data may need to be converted into tensors before training a neural network.


Ensure you have pandas installed. If not, install it via pip:


In [1]:
pip install pandas

#or if you are using conda
#conda install pandas

Collecting pandas
  Downloading pandas-2.2.3-cp312-cp312-macosx_11_0_arm64.whl.metadata (89 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.1.2-cp312-cp312-macosx_14_0_arm64.whl.metadata (60 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.3-cp312-cp312-macosx_11_0_arm64.whl (11.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.4/11.4 MB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading numpy-2.1.2-cp312-cp312-macosx_14_0_arm64.whl (5.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pytz-2024.2-py2.py3-none-any.whl (508 kB)
Downloading tzdata-2024.2-py2.py3-none-any.whl (346 kB)
Installing collected packages: pytz, tzdata, numpy, pandas
Successfu

To use pandas, import it in your Python script or Jupyter notebook:

In [2]:
import pandas as pd

A Series is a one-dimensional array-like object that can hold any data type:

In [4]:
# Creating a simple series
data = [1, 2, 3, 4, 5]
s = pd.Series(data)

# Display the series
print(s)

# pd.Series() creates a pandas Series.
# A Series is like a column in a table, but it can also be used as a one-dimensional array.

0    1
1    2
2    3
3    4
4    5
dtype: int64


A DataFrame is a two-dimensional table, similar to a spreadsheet or SQL table:

In [5]:
# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

# pd.DataFrame() creates a DataFrame from a dictionary.
# A DataFrame consists of rows and columns.

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston


Pandas makes it easy to load data from external files, such as CSVs:

In [35]:
# Reading data from a CSV file
df = pd.read_csv('data.csv')

# Display the first few rows of the DataFrame
print(df.head())

#pd.read_csv() reads a CSV file into a DataFrame.
#df.head() shows the first 5 rows of the DataFrame.

      Name  Age        City   Salary       Degree Height  Weight
0    Alice   30    New York   100000           CS    6'5     102
1      Bob   15  Providence  2000000  Art History    4'5     103
2  Charile   45     Chicago    50000         Math    3'7     104
3    David   18      Boston    10000      Biology    7'5     105


Pandas provides many ways to manipulate your data:
Here are some of the ways you can maipulate your data

In [34]:
# Select the 'Name' column
print(df['Name'])

0      Alice
1        Bob
2    Charile
3      David
Name: Name, dtype: object


In [36]:
# Filter rows where Age is greater than 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)

      Name  Age     City  Salary Degree Height  Weight
2  Charile   45  Chicago   50000   Math    3'7     104


In [37]:
# Add a new column to the DataFrame
df['Salary'] = [70000, 80000, 120000, 95000]
print(df)

      Name  Age        City  Salary       Degree Height  Weight
0    Alice   30    New York   70000           CS    6'5     102
1      Bob   15  Providence   80000  Art History    4'5     103
2  Charile   45     Chicago  120000         Math    3'7     104
3    David   18      Boston   95000      Biology    7'5     105


In [38]:
# Drop the 'City' column
df = df.drop(columns=['City'])
print(df)

      Name  Age  Salary       Degree Height  Weight
0    Alice   30   70000           CS    6'5     102
1      Bob   15   80000  Art History    4'5     103
2  Charile   45  120000         Math    3'7     104
3    David   18   95000      Biology    7'5     105


Pandas allows you to group and summarize your data:

In [45]:
# Group by 'Salary' and calculate the average 'Age'
grouped_df = df.groupby('Salary')['Age'].mean()
print(grouped_df)

Salary
70000     30.0
80000     15.0
95000     18.0
120000    45.0
Name: Age, dtype: float64


After manipulating your data, you can save it to a CSV file:

In [47]:
# Save the DataFrame to a CSV file
df.to_csv('output.csv', index=False)

#df.to_csv() writes the DataFrame to a CSV file.
#index=False ensures that the index is not saved to the file.