## Pandas

- Pandas is an open-source Python library that provides data structure and data analysis tools,
- Used for working with datasets,
- name comes from "Panel Data" or "Python data analysis"


### Why pandas in Data Science ?

- Store data in an easy-to-use table format,
- Data Loading: Read data from files like csv, XML, JSON, ZIP, etc
- Data Cleaning: Helps to remove duplicates, fill missing values, fix errors
- Data Exploration: Summarize data with statistics and visual checks
- Data Transformation: filter and select specific rows or columns, sort data by any columns, group data and calculate sums, average, counts, etc,
- Feature Engineering: create new columns or features from existing data
- Data Integration: merge or join multiple datasets,
- Time series analysis: work with dates and times effectively,
- Prepare data for machine learning: Format and structure data for ML models
- Quick Prototyping: Test ideas fast with easy data manipulation,
- Exporting Results: Save cleaned and processed data for reports or further analysis



#### Install pandas library

- Go to terminal (press Ctrl + Shift + `), this will open your terminal,
- Type pip install pandas

In [2]:
import pandas as pd

Pandas is usually imported under pd alias,
[where in python, alias are an alternate name for referring to same thing]

#### Read csv file

In [3]:
# Read the csv file into a dataframe
df = pd.read_csv("data.csv")

# Set option to display maximum of 100 rows
pd.set_option('display.max_rows', 100)

# Set option to display maximum 10 columns
pd.set_option('display.max_columns', 10)

# Display resulting DataFrame
df

Unnamed: 0,Name,Age,City,Marks
0,Aarav,20,Pokhara,78
1,Sita,22,Kathmandu,85
2,Krishna,23,Pokhara,99
3,Bikash,21,Biratnagar,67
4,Rina,23,Butwal,74
5,Kamal,20,Pokhara,90


### DataFrame

In pandas we store the data in the form of data frame. DataFrame is two dimensional in nature consisting rows and cols. 
- A multiple series together form a dataframe where, series is one dimensional in nature 
- Every column of DataFrame is Series
- Dataframe is shortened as df

##### Accessing define rows


In [4]:
# df.head(no. of rows) >> by default it gives 1st five rows
df.head()

Unnamed: 0,Name,Age,City,Marks
0,Aarav,20,Pokhara,78
1,Sita,22,Kathmandu,85
2,Krishna,23,Pokhara,99
3,Bikash,21,Biratnagar,67
4,Rina,23,Butwal,74


In [5]:
# df.tail(no. of rows) >> by default it gives last five rows
df.tail()

Unnamed: 0,Name,Age,City,Marks
1,Sita,22,Kathmandu,85
2,Krishna,23,Pokhara,99
3,Bikash,21,Biratnagar,67
4,Rina,23,Butwal,74
5,Kamal,20,Pokhara,90


In [6]:
# know the number of rows and columns
df.shape

(6, 4)

In [7]:
# to know the name of columns only
df.columns

Index(['Name', 'Age', 'City', 'Marks'], dtype='object')

In [8]:
# generates descriptive statistics like count, mean, standard deviation, minimum, maximum, 25th, 50th median, 75th percentiles
df.describe()

Unnamed: 0,Age,Marks
count,6.0,6.0
mean,21.5,82.166667
std,1.378405,11.548449
min,20.0,67.0
25%,20.25,75.0
50%,21.5,81.5
75%,22.75,88.75
max,23.0,99.0


In [9]:
# Check data type of data frame whether the column is int, float, etc
df.info()


print("\n Using dtypes")
df.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    6 non-null      object
 1   Age     6 non-null      int64 
 2   City    6 non-null      object
 3   Marks   6 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 324.0+ bytes

 Using dtypes


Name     object
Age       int64
City     object
Marks     int64
dtype: object

In [10]:
df.isna()

Unnamed: 0,Name,Age,City,Marks
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
5,False,False,False,False


In [11]:
df.isnull()

Unnamed: 0,Name,Age,City,Marks
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
5,False,False,False,False


In [12]:
df.notna()

Unnamed: 0,Name,Age,City,Marks
0,True,True,True,True
1,True,True,True,True
2,True,True,True,True
3,True,True,True,True
4,True,True,True,True
5,True,True,True,True


In [13]:
df.notnull()

Unnamed: 0,Name,Age,City,Marks
0,True,True,True,True
1,True,True,True,True
2,True,True,True,True
3,True,True,True,True
4,True,True,True,True
5,True,True,True,True
