<a href="https://colab.research.google.com/github/pandiarajan-src/IKJourney/blob/main/concepts/Pandas_demo0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas

## What is Pandas? (comes from Panel Data)

- Pandas is a powerful python library widely used for data manipulation and analyis.
- Pandas ofers data structures and functions that simplify working with structured data, making it an essential tool in data science and analysis.

## Why use Pandas

- Using Pandas offers several benefits for data manipulation and data analysis.
  - Efficient Data Handling
  - Data Alignment
  - Handling missing data
  - Data Integration
  - Flexible Data Transformation
  - Integration with other libraries








In [None]:
! pip install pandas

In [None]:
import pandas as pd

## Data Structures in Pandas

- Pandas `Series` is 1-dimensional array with axis labels
- Panads `DataFrame` is 2-dimensional data structure with labeled rows and columns. (more than one series forms a dataframe)

In [None]:
# Load/Read data from a CSV file
# read_*** can be of anything from sql, xml, pickle, json or csv etcc
df = pd.read_csv('https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv')

In [None]:
# Show the first 5 rows by default
df.head()

In [None]:
# Show the first 10 rows
df.head(10)

In [None]:
# Show the last 5 rows by default
df.tail()

In [None]:
# To use local files, download the csv from internet
# this is not pandas, basic linux command to download csv
! wget https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv

In [None]:
# Load CSV files from local database
df = pd.read_csv('./iris.csv')
df.head()

## Pandas exploratory methods

1. `info()`
2. `describe()`

In [None]:
df.info()

In [None]:
df.describe()

## Filtering a Data Frame - Why do we need filtering?

- When we need to remove redundant or unnecessary data for some tasks.
- When we need to find customers eligible for a promotion
- When we need to filter out rows or columns have missing values.
- When we want to fileter customers baed on the amount spent.

### Filtering a Data Frame
- Filtering with `loc` and `iloc` methods
- Filtering by selecting a subset of columns
- Filtering by conditions(s)


#### Filtering a Data Frame - `loc` and `iloc` methods
- `loc` uses row and column lables.
- `iloc` uses row and column indexes.

In [None]:
# Lets select a Series
df["species"]

In [None]:
# Lets apply first very simple filter
df["species"] == 'versicolor'
(df["species"] == 'versicolor') & (df["petal_length"] >= 5.0)

In [None]:
# Lets apply first very simple filter
# df[df["species"] == 'versicolor']
df[(df["species"] == 'versicolor') & (df["petal_length"] >= 5.0)]

## Handling missing values in Pandas


In [None]:
# For a series, you can fill null values with
df['petal_width'] = df['petal_width'].fillna(1.2)

In [None]:
df = df.fillna(1.2)
df.head()

In [None]:
df[~df.isna()]

In [None]:
df.dropna()

## Sorthing and others


In [None]:
df.sort_values('sepal_width', ascending=False)

In [None]:
# Create a new column
# How to replace the value
# How to drop the column
# Show apply -> mostly not used because it lets you create a custom function

In [None]:
df['big_flower'] = (df['sepal_length'] > 5) & (df['petal_length'] > 3)

In [None]:
df['big_flower'].value_counts()

In [None]:
df.replace('setosa', 'setosa-flower', inplace=True)

In [None]:
df.head()

In [None]:
df.replace('setosa-flower', 'setosa', inplace=True)

In [None]:
def big_flower_str(x):
  if x is False:
    return "Small Flower"
  else:
    return "Big Flower"

In [None]:
df['size_of_flower'] = df['big_flower'].apply(big_flower_str)
df.head()

In [None]:
df['size_of_flower'].value_counts()

In [None]:
# axis 0 stands for column
# axis is N stands for rows.
df.drop('big_flower', axis=1, inplace=True)

## Data Analysis with Pandas

- Data Analysis: The process of referring insights

In [None]:
# Lets start with a question

# Each of the species of the flower is a little bit different from each other
# and I need a way to get metrics for each of the species differently

# Performed level 2 analysis with one line

# Mean - Average of a column
# Median - the middle of an ascending data

df.groupby('species').agg({
    'sepal_length': 'mean',
    'sepal_width': 'mean',
    'petal_length': 'median',
    'petal_width': 'median',
}).reset_index()

In [None]:
df.groupby(['species', 'size_of_flower']).agg({
    'sepal_length': ['mean', 'median', 'count'],
    'sepal_width': 'mean',
    'petal_length': 'median',
    'petal_width': 'median',
})

In [None]:
# Syntax for concat and merge

# pd.concat(df1, df2, axis=0)

# pd.merge(df1, df2, on='join_column', how='inner')
