## Pandas

- Pandas is an open-source Python library that provides data structure and data analysis tools,
- Used for working with datasets,
- name comes from "Panel Data" or "Python data analysis"


### Why pandas in Data Science ?

- Store data in an easy-to-use table format,
- Data Loading: Read data from files like csv, XML, JSON, ZIP, etc
- Data Cleaning: Helps to remove duplicates, fill missing values, fix errors
- Data Exploration: Summarize data with statistics and visual checks
- Data Transformation: filter and select specific rows or columns, sort data by any columns, group data and calculate sums, average, counts, etc,
- Feature Engineering: create new columns or features from existing data
- Data Integration: merge or join multiple datasets,
- Time series analysis: work with dates and times effectively,
- Prepare data for machine learning: Format and structure data for ML models
- Quick Prototyping: Test ideas fast with easy data manipulation,
- Exporting Results: Save cleaned and processed data for reports or further analysis



#### Install pandas library

- Go to terminal (press Ctrl + Shift + `), this will open your terminal,
- Type pip install pandas

In [7]:
import pandas as pd

Pandas is usually imported under pd alias,
[where in python, alias are an alternate name for referring to same thing]

#### Read csv file

In [8]:
# Read the csv file into a dataframe
df = pd.read_csv("data.csv")

# Set option to display maximum of 100 rows
pd.set_option('display.max_rows', 100)

# Set option to display maximum 10 columns
pd.set_option('display.max_columns', 10)

# Display resulting DataFrame
df

Unnamed: 0,Name,Age,City,Marks
0,Aarav,20,Pokhara,78
1,Sita,22,Kathmandu,85
2,Krishna,23,Pokhara,99
3,Bikash,21,Biratnagar,67
4,Rina,23,Butwal,74
5,Kamal,20,Pokhara,90


### DataFrame

In pandas we store the data in the form of data frame. DataFrame is two dimensional in nature consisting rows and cols. 
- A multiple series together form a dataframe where, series is one dimensional in nature 
- Every column of DataFrame is Series
- Dataframe is shortened as df

##### Accessing define rows


In [15]:
# df.head(no. of rows) >> by default it gives 1st five rows
df.head()

Unnamed: 0,Name,Age,City,Marks
0,Aarav,20,Pokhara,78
1,Sita,22,Kathmandu,85
2,Krishna,23,Pokhara,99
3,Bikash,21,Biratnagar,67
4,Rina,23,Butwal,74


In [16]:
# df.tail(no. of rows) >> by default it gives last five rows
df.tail()

Unnamed: 0,Name,Age,City,Marks
1,Sita,22,Kathmandu,85
2,Krishna,23,Pokhara,99
3,Bikash,21,Biratnagar,67
4,Rina,23,Butwal,74
5,Kamal,20,Pokhara,90


##### Accessing Row and Column of DataFrame

dataframe[x:y:z]
x --> Start, y --> End, z= Steps (Like string slicing)



In [36]:
# Access particular row
df[2:5]


Unnamed: 0,Name,Age,City,Marks,Remarks
2,Krishna,23,Pokhara,99,Good
3,Bikash,21,Biratnagar,67,Good
4,Rina,23,Butwal,74,Good


In [37]:
# Accessing particular row with step (like range)
df[0:5:2]

Unnamed: 0,Name,Age,City,Marks,Remarks
0,Aarav,20,Pokhara,78,Good
2,Krishna,23,Pokhara,99,Good
4,Rina,23,Butwal,74,Good


#### Implicit/ Integer/ Internal Indexing

iloc = implicit, integer or internal indexing

- Always require python internal indexing
- Whatever indexing pandas give-by-default, we can use it with

In [64]:
# display the row from index 2 to 3
df.iloc[0:3]

Unnamed: 0,Name,Age,City,Marks,Remarks
0,Aarav,20,Pokhara,78,Good
1,Sita,22,Kathmandu,85,Good
2,Krishna,23,Pokhara,99,Good


#### Explicit/ Name Indexing 

In [65]:
# indexing in which end point is also included

df.loc[0:3]

Unnamed: 0,Name,Age,City,Marks,Remarks
0,Aarav,20,Pokhara,78,Good
1,Sita,22,Kathmandu,85,Good
2,Krishna,23,Pokhara,99,Good
3,Bikash,21,Biratnagar,67,Good


In [51]:
# create a new column

df["Remarks"] = "Good"
df

Unnamed: 0,Name,Age,City,Marks,Remarks
0,Aarav,20,Pokhara,78,Good
1,Sita,22,Kathmandu,85,Good
2,Krishna,23,Pokhara,99,Good
3,Bikash,21,Biratnagar,67,Good
4,Rina,23,Butwal,74,Good
5,Kamal,20,Pokhara,90,Good


In [52]:
# know the unique value of column

df["City"].unique()

array(['Pokhara', 'Kathmandu', 'Biratnagar', 'Butwal'], dtype=object)

In [53]:
# count unique value of column

df["City"].nunique()

4

In [54]:
# unique value counts separately

df["City"].value_counts()

City
Pokhara       3
Kathmandu     1
Biratnagar    1
Butwal        1
Name: count, dtype: int64

In [55]:
# display the column Age only
df["Age"]

0    20
1    22
2    23
3    21
4    23
5    20
Name: Age, dtype: int64

In [56]:
# know the number of rows and columns
df.shape

(6, 5)

In [57]:
# to know the name of columns only
df.columns

Index(['Name', 'Age', 'City', 'Marks', 'Remarks'], dtype='object')

In [58]:
# generates descriptive statistics like count, mean, standard deviation, minimum, maximum, 25th, 50th median, 75th percentiles
df.describe()

Unnamed: 0,Age,Marks
count,6.0,6.0
mean,21.5,82.166667
std,1.378405,11.548449
min,20.0,67.0
25%,20.25,75.0
50%,21.5,81.5
75%,22.75,88.75
max,23.0,99.0


In [59]:
df.describe(include='object')

Unnamed: 0,Name,City,Remarks
count,6,6,6
unique,6,4,1
top,Aarav,Pokhara,Good
freq,1,3,6


Data are of two types:
- Qualitative --> Categorical --> String
- Quantitative --> Numerical

Again, Numerical data can be categorized into two types:
- Continuous: Float --> Any value on number line
- Discrete : Integer --> Only whole number, finite value

Output for numeric data
- count: No. of elements/ No. of non-null entries
- mean: Avg of column
- std: Standard deviation
- min: minimum value of any column
- 25%: 25% percentile (1st quartile)
- 50%: 50% percentile (median)
- 75%: 75% percentile (3rd quartile)
- max: maximum value of column

Output for non-numeric(Categorical/Object) data
- count: no. of non-null entries
- unique: no. of unique values
- top: most frequent value
- freq: frequency(count) of most frequent value



In [60]:
# describe both numeric and categorical data
df.describe(include='all')

Unnamed: 0,Name,Age,City,Marks,Remarks
count,6,6.0,6,6.0,6
unique,6,,4,,1
top,Aarav,,Pokhara,,Good
freq,1,,3,,6
mean,,21.5,,82.166667,
std,,1.378405,,11.548449,
min,,20.0,,67.0,
25%,,20.25,,75.0,
50%,,21.5,,81.5,
75%,,22.75,,88.75,


In [61]:
# to know numerical column information in terms of categorical column
df.astype('object').describe()

Unnamed: 0,Name,Age,City,Marks,Remarks
count,6,6,6,6,6
unique,6,4,4,6,1
top,Aarav,20,Pokhara,78,Good
freq,1,2,3,1,6


In [62]:
# Check data type of data frame whether the column is int, float, etc
df.info()


print("\n Using dtypes")
df.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Name     6 non-null      object
 1   Age      6 non-null      int64 
 2   City     6 non-null      object
 3   Marks    6 non-null      int64 
 4   Remarks  6 non-null      object
dtypes: int64(2), object(3)
memory usage: 372.0+ bytes

 Using dtypes


Name       object
Age         int64
City       object
Marks       int64
Remarks    object
dtype: object