##### Pandas - Data Frame and Series

Pandas is a powerful data manipulation library in Python, widely used for Data Analysis and Data Cleaning. It provied 2 primary data structures: DataFrame and Series. A Series is a 1-D Array like object, while a DataFrame is a 2-D, size mutable and potentially heterogenous tabular data structure with labeled axes (rows and columns)

In [3]:
import pandas as pd

In [None]:
## Series

## A Pandas Series is a 1D Array like object that can hold any data type. It is similar to a column in a table

data = [1,2,3,4,5,6]
series = pd.Series(data)
print(series)
print(type(series))

'''
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
<class 'pandas.core.series.Series'>
'''

In [None]:
## Create a series from Dictionary Elements

data = {'a': 1, 'b': 2, 'c': 3}
series_dict = pd.Series(data)
print(series_dict)

'''
a    1
b    2
c    3
dtype: int64
'''

In [None]:
data  = [10,20,30]
index = ['a', 'b', 'c']
pd.Series(data, index=index)

'''
a    10
b    20
c    30
dtype: int64
'''

In [None]:
## Dataframe

## Create a dataframe from a dictionary of list

data={
    'Name':['Krish','John','Jack'],
    'Age':[25,30,45],
    'City':['Bangalore','New York','Florida']
}

df = pd.DataFrame(data)
print(df)
print(type(df))

'''
    Name  Age       City
0  Krish   25  Bangalore
1   John   30   New York
2   Jack   45    Florida
<class 'pandas.core.frame.DataFrame'>
'''

In [None]:
## We can also use numpy to convert this dataframe to 2D array

import numpy as np

arr = np.array(df)
print(arr)

'''
[['Krish' 25 'Bangalore']
 ['John' 30 'New York']
 ['Jack' 45 'Florida']]
'''

In [None]:
## Create a data from a list of dictionaries

data=[
    {'Name':'Krish','Age':32,'City':'Bangalore'},
    {'Name':'John','Age':34,'City':'Bangalore'},
    {'Name':'Bappy','Age':32,'City':'Bangalore'},
    {'Name':'JAck','Age':32,'City':'Bangalore'}
    
]

df = pd.DataFrame(data)
print(df)

'''
    Name  Age       City
0  Krish   32  Bangalore
1   John   34  Bangalore
2  Bappy   32  Bangalore
3   JAck   32  Bangalore
'''

In [None]:
## Consider sales_data.csv

df = pd.read_csv('sales_data.csv')
df.head(5)  # To see the top 5 elements of the dataframe

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal


In [17]:
df.tail(5)

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
235,10236,2024-08-23,Home Appliances,Nespresso Vertuo Next Coffee and Espresso Maker,1,159.99,159.99,Europe,PayPal
236,10237,2024-08-24,Clothing,Nike Air Force 1 Sneakers,3,90.0,270.0,Asia,Debit Card
237,10238,2024-08-25,Books,The Handmaid's Tale by Margaret Atwood,3,10.99,32.97,North America,Credit Card
238,10239,2024-08-26,Beauty Products,Sunday Riley Luna Sleeping Night Oil,1,55.0,55.0,Europe,PayPal
239,10240,2024-08-27,Sports,Yeti Rambler 20 oz Tumbler,2,29.99,59.98,Asia,Credit Card


In [None]:
data={
    'Name':['Krish','John','Jack'],
    'Age':[25,30,45],
    'City':['Bangalore','New York','Florida']
}

df = pd.DataFrame(data)
df


Unnamed: 0,Name,Age,City
0,Krish,25,Bangalore
1,John,30,New York
2,Jack,45,Florida


In [None]:
## Accessing Data from DataFrame

df['Name']

'''
0    Krish
1     John
2     Jack
Name: Name, dtype: object
'''

In [None]:
type(df['Name'])  # pandas.core.series.Series -> When we call one column from a DataFrame, it becomes a Series

In [None]:
## To traverse over a given row

df.loc[0]

'''
Name        Krish
Age            25
City    Bangalore
Name: 0, dtype: object
'''

In [31]:
df

Unnamed: 0,Name,Age,City
0,Krish,25,Bangalore
1,John,30,New York
2,Jack,45,Florida


In [None]:
## Accessing a particular element
print(df.loc[0][0])  # Krish

In [None]:
print(df.at[1,'City'])  # New York
print(df.at[2,'City'])  # Florida
print(df.at[0,'Age'])  # 25

In [None]:
## Accessing a specified element using iat

df.iat[2,2]  # 'Florida'

In [None]:
df

Unnamed: 0,Name,Age,City
0,Krish,25,Bangalore
1,John,30,New York
2,Jack,45,Florida


In [47]:
## Data Manipulation with Data Frames

## Adding a new column

df['Salary'] = [50000, 60000, 70000]
df

Unnamed: 0,Name,Age,City,Salary
0,Krish,25,Bangalore,50000
1,John,30,New York,60000
2,Jack,45,Florida,70000


In [None]:
## Removing a column
df.drop('Salary')  # Error -> KeyError: "['Salary'] not found in axis"

In [None]:
'''
In pandas, the axis parameter tells pandas whether you’re working along rows or columns:

axis=0 → operate along the rows (this is the default).

Example: df.drop(3) → drops the row with index 3.

axis=1 → operate along the columns.

Example: df.drop('Salary', axis=1) → drops the column named "Salary".
'''

In [50]:
df.drop('Salary', axis=1)

Unnamed: 0,Name,Age,City
0,Krish,25,Bangalore
1,John,30,New York
2,Jack,45,Florida


In [None]:
df  # This still shows the Salary column because the drop operation we performed was temporary.

Unnamed: 0,Name,Age,City,Salary
0,Krish,25,Bangalore,50000
1,John,30,New York,60000
2,Jack,45,Florida,70000


In [None]:
## To perform the drop operation permanently from the DataFrame

df.drop('Salary', axis=1, inplace=True)

In [54]:
df

Unnamed: 0,Name,Age,City
0,Krish,25,Bangalore
1,John,30,New York
2,Jack,45,Florida


In [55]:
## To drop a row
df.drop(1)

Unnamed: 0,Name,Age,City
0,Krish,25,Bangalore
2,Jack,45,Florida


In [None]:
df  ## Showing the row 1 because the above function ran temporarily

Unnamed: 0,Name,Age,City
0,Krish,25,Bangalore
1,John,30,New York
2,Jack,45,Florida


In [None]:
## Add Age to the Column
df['Age'] = df['Age'] + 1

In [58]:
df

Unnamed: 0,Name,Age,City
0,Krish,26,Bangalore
1,John,31,New York
2,Jack,46,Florida


In [59]:
df = pd.read_csv('sales_data.csv')
df

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,15.99,63.96,North America,Credit Card
4,10005,2024-01-05,Beauty Products,Neutrogena Skincare Set,1,89.99,89.99,Europe,PayPal
...,...,...,...,...,...,...,...,...,...
235,10236,2024-08-23,Home Appliances,Nespresso Vertuo Next Coffee and Espresso Maker,1,159.99,159.99,Europe,PayPal
236,10237,2024-08-24,Clothing,Nike Air Force 1 Sneakers,3,90.00,270.00,Asia,Debit Card
237,10238,2024-08-25,Books,The Handmaid's Tale by Margaret Atwood,3,10.99,32.97,North America,Credit Card
238,10239,2024-08-26,Beauty Products,Sunday Riley Luna Sleeping Night Oil,1,55.00,55.00,Europe,PayPal


In [60]:
## Let's perform a statistical analysis on this DataFrame

df.describe()

Unnamed: 0,Transaction ID,Units Sold,Unit Price,Total Revenue
count,240.0,240.0,240.0,240.0
mean,10120.5,2.158333,236.395583,335.699375
std,69.42622,1.322454,429.446695,485.804469
min,10001.0,1.0,6.5,6.5
25%,10060.75,1.0,29.5,62.965
50%,10120.5,2.0,89.99,179.97
75%,10180.25,3.0,249.99,399.225
max,10240.0,10.0,3899.99,3899.99
