<a href="https://colab.research.google.com/github/pandiarajan-src/IKJourney/blob/main/concepts/pandas_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas

## What is Pandas? (comes from Panel Data)

- Pandas is a powerful python library widely used for data manipulation and analyis.
- Pandas ofers data structures and functions that simplify working with structured data, making it an essential tool in data science and analysis.

## Why use Pandas

- Using Pandas offers several benefits for data manipulation and data analysis.
  - Efficient Data Handling
  - Data Alignment
  - Handling missing data
  - Data Integration
  - Flexible Data Transformation
  - Integration with other libraries








In [None]:
! pip install pandas

In [2]:
import pandas as pd

## Data Structures in Pandas

- Pandas `Series` is 1-dimensional array with axis labels
- Panads `DataFrame` is 2-dimensional data structure with labeled rows and columns. (more than one series forms a dataframe)

In [3]:
# Load/Read data from a CSV file
# read_*** can be of anything from sql, xml, pickle, json or csv etcc
df = pd.read_csv('https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv')

In [4]:
# Show the first 5 rows by default
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [None]:
# Show the first 10 rows
df.head(10)

In [None]:
# Show the last 5 rows by default
df.tail()

In [None]:
# To use local files, download the csv from internet
# this is not pandas, basic linux command to download csv
! wget https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv

In [None]:
# Load CSV files from local database
df = pd.read_csv('./iris.csv')
df.head()

## Pandas exploratory methods

1. `info()`
2. `describe()`

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [6]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


## Filtering a Data Frame - Why do we need filtering?

- When we need to remove redundant or unnecessary data for some tasks.
- When we need to find customers eligible for a promotion
- When we need to filter out rows or columns have missing values.
- When we want to fileter customers baed on the amount spent.

### Filtering a Data Frame
- Filtering with `loc` and `iloc` methods
- Filtering by selecting a subset of columns
- Filtering by conditions(s)


#### Filtering a Data Frame - `loc` and `iloc` methods
- `loc` uses row and column lables.
- `iloc` uses row and column indexes.

In [9]:
# Lets select a Series
df["species"]

0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...    
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: species, Length: 150, dtype: object

In [25]:
# Lets apply first very simple filter
df["species"] == 'versicolor'
(df["species"] == 'versicolor') & (df["petal_length"] >= 5.0)

0      False
1      False
2      False
3      False
4      False
       ...  
145    False
146    False
147    False
148    False
149    False
Length: 150, dtype: bool

In [26]:
# Lets apply first very simple filter
# df[df["species"] == 'versicolor']
df[(df["species"] == 'versicolor') & (df["petal_length"] >= 5.0)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
77,6.7,3.0,5.0,1.7,versicolor
83,6.0,2.7,5.1,1.6,versicolor


## Handling missing values in Pandas


In [27]:
# For a series, you can fill null values with
df['petal_width'] = df['petal_width'].fillna(1.2)

In [28]:
df = df.fillna(1.2)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [31]:
df[~df.isna()]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [32]:
df.dropna()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


## Sorthing and others


In [34]:
df.sort_values('sepal_width', ascending=False)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
15,5.7,4.4,1.5,0.4,setosa
33,5.5,4.2,1.4,0.2,setosa
32,5.2,4.1,1.5,0.1,setosa
14,5.8,4.0,1.2,0.2,setosa
16,5.4,3.9,1.3,0.4,setosa
...,...,...,...,...,...
87,6.3,2.3,4.4,1.3,versicolor
62,6.0,2.2,4.0,1.0,versicolor
68,6.2,2.2,4.5,1.5,versicolor
119,6.0,2.2,5.0,1.5,virginica


In [None]:
# Create a new column
# How to replace the value
# How to drop the column
# Show apply -> mostly not used because it lets you create a custom function

In [38]:
df['big_flower'] = (df['sepal_length'] > 5) & (df['petal_length'] > 3)

big_flower
True     95
False    55
Name: count, dtype: int64

In [39]:
df['big_flower'].value_counts()

big_flower
True     95
False    55
Name: count, dtype: int64

In [42]:
df.replace('setosa', 'setosa-flower', inplace=True)

In [45]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,big_flower
0,5.1,3.5,1.4,0.2,setosa,False
1,4.9,3.0,1.4,0.2,setosa,False
2,4.7,3.2,1.3,0.2,setosa,False
3,4.6,3.1,1.5,0.2,setosa,False
4,5.0,3.6,1.4,0.2,setosa,False


In [44]:
df.replace('setosa-flower', 'setosa', inplace=True)

In [46]:
def big_flower_str(x):
  if x is False:
    return "Small Flower"
  else:
    return "Big Flower"

In [47]:
df['size_of_flower'] = df['big_flower'].apply(big_flower_str)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,big_flower,size_of_flower
0,5.1,3.5,1.4,0.2,setosa,False,Small Flower
1,4.9,3.0,1.4,0.2,setosa,False,Small Flower
2,4.7,3.2,1.3,0.2,setosa,False,Small Flower
3,4.6,3.1,1.5,0.2,setosa,False,Small Flower
4,5.0,3.6,1.4,0.2,setosa,False,Small Flower


In [48]:
df['size_of_flower'].value_counts()

size_of_flower
Big Flower      95
Small Flower    55
Name: count, dtype: int64

In [None]:
# axis 0 stands for column
# axis is N stands for rows.
df.drop('big_flower', axis=1, inplace=True)

## Data Analysis with Pandas

- Data Analysis: The process of referring insights

In [53]:
# Lets start with a question

# Each of the species of the flower is a little bit different from each other
# and I need a way to get metrics for each of the species differently

# Performed level 2 analysis with one line

# Mean - Average of a column
# Median - the middle of an ascending data

df.groupby('species').agg({
    'sepal_length': 'mean',
    'sepal_width': 'mean',
    'petal_length': 'median',
    'petal_width': 'median',
}).reset_index()

Unnamed: 0,species,sepal_length,sepal_width,petal_length,petal_width
0,setosa,5.006,3.418,1.5,0.2
1,versicolor,5.936,2.77,4.35,1.3
2,virginica,6.588,2.974,5.55,2.0


In [57]:
df.groupby(['species', 'size_of_flower']).agg({
    'sepal_length': ['mean', 'median', 'count'],
    'sepal_width': 'mean',
    'petal_length': 'median',
    'petal_width': 'median',
})

Unnamed: 0_level_0,Unnamed: 1_level_0,sepal_length,sepal_length,sepal_length,sepal_width,petal_length,petal_width
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,median,count,mean,median,median
species,size_of_flower,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
setosa,Small Flower,5.006,5.0,50,3.418,1.5,0.2
versicolor,Big Flower,6.017391,6.0,46,2.81087,4.4,1.3
versicolor,Small Flower,5.0,5.0,4,2.3,3.3,1.0
virginica,Big Flower,6.622449,6.5,49,2.983673,5.6,2.0
virginica,Small Flower,4.9,4.9,1,2.5,4.5,1.7


In [None]:
# Syntax for concat and merge

# pd.concat(df1, df2, axis=0)

# pd.merge(df1, df2, on='join_column', how='inner')
