What is Pandas & Imports?

Pandas is an open-source data analysis and manipulation tool built on top of the Python programming language.

It's widely used for data science and machine learning tasks.

To import Pandas, you can use:

In [1]:
import pandas as pd

## Series: A one-dimensional labeled array capable of holding any data type.



In [4]:
import pandas as pd
import numpy as np

s = pd.Series([1, 3, 5, np.nan, 6, 8]) # np.nan is a missing value # we have different datatypes in the series
print(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


## DataFrame: A two-dimensional labeled data structure with columns of potentially different types.

In [6]:
data = {'A': [1, 2, 3],
                       'B': [4, 5, 6]}
df = pd.DataFrame(data)
print(df)

   A  B
0  1  4
1  2  5
2  3  6


## Index: The labels of the rows and columns of the Series and DataFrame.



In [7]:
index = pd.Index(['a', 'b', 'c'])
print(index)

Index(['a', 'b', 'c'], dtype='object')


## Creating DataFrames

In [8]:
data = {'Name': ['Tom', 'Jerry', 'Mickey'], 'Age': [20, 21, 19]}
df = pd.DataFrame(data)
print(df)

     Name  Age
0     Tom   20
1   Jerry   21
2  Mickey   19


## Accessing Data in DataFrames: Filtering & Selecting Data

In [11]:
# Selecting a column
ages = df['Age']
print(ages)

0    20
1    21
2    19
Name: Age, dtype: int64


In [10]:
# Filtering rows
adults = df[df['Age'] > 20]
print(adults)

    Name  Age
1  Jerry   21


## Adding & Removing Columns

In [19]:
df['Gender'] = ['M', 'M', 'M']  # Adding a column
print(df)

     Name  Age Gender
0     Tom   20      M
1   Jerry   21      M
2  Mickey   19      M


In [20]:
df.drop('Gender', axis=1, inplace=True)  # Removing a column
print(df)


     Name  Age
0     Tom   20
1   Jerry   21
2  Mickey   19


## Merging & Joining DataFrames

In [21]:
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value': [4, 5, 6]})
merged_df = pd.merge(df1, df2, on='key', how='inner')  # Inner join
print(merged_df)

  key  value_x  value_y
0   B        2        4
1   C        3        5


## Grouping and Aggregating Data

In [22]:
grouped = df.groupby('Age').sum()
print(grouped)

       Name
Age        
19   Mickey
20      Tom
21    Jerry


## Pivot Tables

In [26]:
# Here, we are creating a DataFrame df with four columns: A, B, C, and D. The DataFrame looks like this:

#      A    B      C  D
# 0  foo  one  small  1
# 1  foo  one  large  2
# 2  foo  two  large  2
# 3  bar  two  small  3
# 4  bar  one  small  3
# 5  bar  one  large  4 

In [28]:
# Here's what each part of the pivot_table method does:

# values='D': We are interested in the values from the D column.
# index=['A', 'B']: We want to create a multi-level index using columns A and B.
# columns=['C']: The values in the C column will become the columns in the pivot table.
# aggfunc=np.sum: The aggregation function we are using is np.sum, which will sum up the values.

In [29]:
# Explanation

# The rows of the pivot table are indexed by the combinations of A and B.
# The columns of the pivot table are created from the unique values in C.
# The values in the pivot table are the sum of D for each combination of A, B, and C.

# For example:

# For A='foo' and B='one', the D values for C='large' and C='small' are 2 and 1 respectively. So, the table has 2.0 and 1.0 in the corresponding cells.
# For A='bar' and B='one', the D values for C='large' and C='small' are 4 and 3 respectively. So, the table has 4.0 and 3.0 in the corresponding cells.
# For combinations where no data exists (e.g., A='bar' and B='two' with C='large'), the result is NaN.
# The pivot table provides a way to see the summarized data in a structured and easy-to-read format.

In [24]:
df = pd.DataFrame({
    'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
    'B': ['one', 'one', 'two', 'two', 'one', 'one'],
    'C': ['small', 'large', 'large', 'small', 'small', 'large'],
    'D': [1, 2, 2, 3, 3, 4]
})
pivot_table = df.pivot_table(values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum)
print(pivot_table)

C        large  small
A   B                
bar one    4.0    3.0
    two    NaN    3.0
foo one    2.0    1.0
    two    2.0    NaN


  pivot_table = df.pivot_table(values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum)


## Reading & Writing CSV Files    

In [31]:
df.to_csv('data.csv', index=False)  # Writing to a CSV file
new_df = pd.read_csv('data.csv')    # Reading from a CSV file
print(new_df)

     A    B      C  D
0  foo  one  small  1
1  foo  one  large  2
2  foo  two  large  2
3  bar  two  small  3
4  bar  one  small  3
5  bar  one  large  4


## Reading & Writing Excel Files

In [35]:
import pandas as pd
import numpy as np

df.to_excel('data.xlsx', index=False)  # Writing to an Excel file
new_df = pd.read_excel('data.xlsx')    # Reading from an Excel file
print(new_df)

     A    B      C  D
0  foo  one  small  1
1  foo  one  large  2
2  foo  two  large  2
3  bar  two  small  3
4  bar  one  small  3
5  bar  one  large  4


## Reading & Writing SQL Databases

In [38]:
# from sqlalchemy import create_engine

# engine = create_engine('sqlite:///:memory:')
# df.to_sql('data', engine)  # Writing to SQL
# new_df = pd.read_sql('data', con=engine)  # Reading from SQL
# print(new_df)

## Reading & Writing JSON Files

In [39]:
df.to_json('data.json')  # Writing to a JSON file
new_df = pd.read_json('data.json')  # Reading from a JSON file
print(new_df)


     A    B      C  D
0  foo  one  small  1
1  foo  one  large  2
2  foo  two  large  2
3  bar  two  small  3
4  bar  one  small  3
5  bar  one  large  4


## Reading & Writing Text & Binary Files



In [42]:
df.to_csv('data.txt', index=False, sep='\t')  # Writing to a text file
new_df = pd.read_csv('data.txt', sep='\t')  # Reading from a text file
print(new_df)

     A    B      C  D
0  foo  one  small  1
1  foo  one  large  2
2  foo  two  large  2
3  bar  two  small  3
4  bar  one  small  3
5  bar  one  large  4


In [41]:

df.to_pickle('data.pkl')  # Writing to a binary file
new_df = pd.read_pickle('data.pkl')  # Reading from a binary file
print(new_df)

     A    B      C  D
0  foo  one  small  1
1  foo  one  large  2
2  foo  two  large  2
3  bar  two  small  3
4  bar  one  small  3
5  bar  one  large  4


## Edge Cases

## Empty DataFrames: Handle scenarios where your DataFrame might be empty after filtering.

In [45]:
data = {'Name': ['Tom', 'Jerry', 'Mickey'], 'Age': [20, 21, 19]}
df = pd.DataFrame(data)
print(df)

     Name  Age
0     Tom   20
1   Jerry   21
2  Mickey   19


In [46]:
empty_df = df[df['Age'] > 100]
if empty_df.empty:
    print("DataFrame is empty")

DataFrame is empty


## Missing Data: Manage missing data using methods like .fillna() or .dropna().



In [47]:
df['Age'].fillna(df['Age'].mean(), inplace=True)  # Filling missing values
print(df)

     Name  Age
0     Tom   20
1   Jerry   21
2  Mickey   19


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)  # Filling missing values


## Duplicate Data: Remove duplicate rows using .drop_duplicates().



In [48]:
df.drop_duplicates(inplace=True)
print(df)

     Name  Age
0     Tom   20
1   Jerry   21
2  Mickey   19
