<a href="https://colab.research.google.com/github/lovnishverma/Python-Getting-Started/blob/main/Pandas_GOTA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pandas**

A beginner-friendly, hands-on crash course to master the Pandas library for data analysis.

**1. Introduction to Pandas**

Pandas is a Python library built for data manipulation, cleaning, analysis, and transformation. It is used heavily in machine learning, data science, and analytics.

**Why Pandas?**

✔ Load data easily

✔ Clean messy datasets

✔ Analyze patterns

✔ Transform and reshape data

✔ Handle missing values

✔ Merge and join multiple datasets

**Installation** (No need to install in google colab because it's already pre-installed)



```
!pip install pandas
```



**Importing Pandas**

In [432]:
import pandas as pd

**Pandas Core Data Structures**

**1 Series:** A 1D labeled array (like a column in Excel).

In [433]:
s = pd.Series([10, 20, 30, 40])
print(s)

0    10
1    20
2    30
3    40
dtype: int64


In [434]:
series = pd.Series(['ram', 'sita', 'lakshman', 'hanuman'])
print(series)

0         ram
1        sita
2    lakshman
3     hanuman
dtype: object


2 **DataFrame:** A 2D table with rows and columns.

In [435]:
data = {
    "Name": ["Amit", "Rahul", "Priya"],
    "Age": [25, 30, 22]
}

df = pd.DataFrame(data)
print(df)


    Name  Age
0   Amit   25
1  Rahul   30
2  Priya   22


In [436]:
dataf = pd.DataFrame({
    "city": ["Assam", "Delhi", "Pune"],
    "crimerate": [25, 30, 22],
    "population": [255.04, 30.5, 265.62]
})
print(dataf)

    city  crimerate  population
0  Assam         25      255.04
1  Delhi         30       30.50
2   Pune         22      265.62


**Reading & Writing Data**

CSV

In [437]:
df = pd.read_csv("army.csv")  # read_csv() Read krta hai yeh function csv file ko
# df.to_csv("output.csv", index=False)  # Write

Excel

In [438]:
# df = pd.read_excel("file.xlsx")

JSON

In [439]:
# df = pd.read_json("file.json")

**Exploring Data**

In [440]:
df

Unnamed: 0,name,age,salary
0,Raju,32.0,50000.0
1,Puneet,34.0,60000.0
2,Ram,25.0,35000.0
3,Lovnish,,30000.0
4,Ravi,28.0,
5,Puneet,34.0,60000.0


In [441]:
df.head() # first 5 rows

Unnamed: 0,name,age,salary
0,Raju,32.0,50000.0
1,Puneet,34.0,60000.0
2,Ram,25.0,35000.0
3,Lovnish,,30000.0
4,Ravi,28.0,


In [442]:
df.head(2) # first 2 rows

Unnamed: 0,name,age,salary
0,Raju,32.0,50000.0
1,Puneet,34.0,60000.0


In [443]:
df.tail() # last 5 rows

Unnamed: 0,name,age,salary
1,Puneet,34.0,60000.0
2,Ram,25.0,35000.0
3,Lovnish,,30000.0
4,Ravi,28.0,
5,Puneet,34.0,60000.0


In [444]:
print(df.shape) # rows, columns  # yeh datframe ki shape btata hai

(6, 3)


In [445]:
df.columns # column names

Index(['name ', 'age', 'salary'], dtype='object')

In [446]:
df.dtypes # DATA KA TYPE BTATA HAI

Unnamed: 0,0
name,object
age,float64
salary,float64


In [447]:
df.info() # data types and  missing values pata chal jati hai

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    6 non-null      object 
 1   age     5 non-null      float64
 2   salary  5 non-null      float64
dtypes: float64(2), object(1)
memory usage: 276.0+ bytes


In [448]:
df.describe() # Statistics Summary # but only for numeric columns

Unnamed: 0,age,salary
count,5.0,5.0
mean,30.6,47000.0
std,3.974921,13964.240044
min,25.0,30000.0
25%,28.0,35000.0
50%,32.0,50000.0
75%,34.0,60000.0
max,34.0,60000.0


# Selecting data

In [449]:
df

Unnamed: 0,name,age,salary
0,Raju,32.0,50000.0
1,Puneet,34.0,60000.0
2,Ram,25.0,35000.0
3,Lovnish,,30000.0
4,Ravi,28.0,
5,Puneet,34.0,60000.0


In [450]:
#Row Slicing
df[1:4] # rows 1 to 3   # end index is exclusive

Unnamed: 0,name,age,salary
1,Puneet,34.0,60000.0
2,Ram,25.0,35000.0
3,Lovnish,,30000.0


In [451]:
df

Unnamed: 0,name,age,salary
0,Raju,32.0,50000.0
1,Puneet,34.0,60000.0
2,Ram,25.0,35000.0
3,Lovnish,,30000.0
4,Ravi,28.0,
5,Puneet,34.0,60000.0


In [452]:
df.iloc[:, :]  # rows, columns
# All the rows and all the columns

Unnamed: 0,name,age,salary
0,Raju,32.0,50000.0
1,Puneet,34.0,60000.0
2,Ram,25.0,35000.0
3,Lovnish,,30000.0
4,Ravi,28.0,
5,Puneet,34.0,60000.0


In [453]:
df.iloc[:, 1] # All the rows, column at the 1st index only

Unnamed: 0,age
0,32.0
1,34.0
2,25.0
3,
4,28.0
5,34.0


In [454]:
df.iloc[0, :] # 1st row , all the columns

Unnamed: 0,0
name,Raju
age,32.0
salary,50000.0


# Selecting the data

In [455]:
df[1:4] # Rows 1 to 3   # end index is exclusive

Unnamed: 0,name,age,salary
1,Puneet,34.0,60000.0
2,Ram,25.0,35000.0
3,Lovnish,,30000.0


In [456]:
df.iloc[3]

Unnamed: 0,3
name,Lovnish
age,
salary,30000.0


In [457]:
df.iloc[1:3] # Rows 1 to 2   # end index is exclusive

Unnamed: 0,name,age,salary
1,Puneet,34.0,60000.0
2,Ram,25.0,35000.0


In [458]:
df.columns

Index(['name ', 'age', 'salary'], dtype='object')

In [459]:
df.loc[:, 'age']  # rows, columns

Unnamed: 0,age
0,32.0
1,34.0
2,25.0
3,
4,28.0
5,34.0


In [460]:
df.loc[:, ['age', 'salary']]

Unnamed: 0,age,salary
0,32.0,50000.0
1,34.0,60000.0
2,25.0,35000.0
3,,30000.0
4,28.0,
5,34.0,60000.0


# data sorting

In [461]:
df.sort_values(by='age', ascending=True)

Unnamed: 0,name,age,salary
2,Ram,25.0,35000.0
4,Ravi,28.0,
0,Raju,32.0,50000.0
1,Puneet,34.0,60000.0
5,Puneet,34.0,60000.0
3,Lovnish,,30000.0


# Creating a new column

In [462]:
#creating a new column
df['new_column'] = df['age'] + df['salary']

In [463]:
df

Unnamed: 0,name,age,salary,new_column
0,Raju,32.0,50000.0,50032.0
1,Puneet,34.0,60000.0,60034.0
2,Ram,25.0,35000.0,35025.0
3,Lovnish,,30000.0,
4,Ravi,28.0,,
5,Puneet,34.0,60000.0,60034.0


In [464]:
df['is_poor'] = df['salary'] < 40000

In [465]:
df

Unnamed: 0,name,age,salary,new_column,is_poor
0,Raju,32.0,50000.0,50032.0,False
1,Puneet,34.0,60000.0,60034.0,False
2,Ram,25.0,35000.0,35025.0,True
3,Lovnish,,30000.0,,True
4,Ravi,28.0,,,False
5,Puneet,34.0,60000.0,60034.0,False


In [466]:
df.groupby('age')['salary'].mean()

Unnamed: 0_level_0,salary
age,Unnamed: 1_level_1
25.0,35000.0
28.0,
32.0,50000.0
34.0,60000.0


In [467]:
df

Unnamed: 0,name,age,salary,new_column,is_poor
0,Raju,32.0,50000.0,50032.0,False
1,Puneet,34.0,60000.0,60034.0,False
2,Ram,25.0,35000.0,35025.0,True
3,Lovnish,,30000.0,,True
4,Ravi,28.0,,,False
5,Puneet,34.0,60000.0,60034.0,False


# Rename a column

In [468]:
df.rename(columns={'is_poor': 'low_salary'}, inplace=True)

In [469]:
df

Unnamed: 0,name,age,salary,new_column,low_salary
0,Raju,32.0,50000.0,50032.0,False
1,Puneet,34.0,60000.0,60034.0,False
2,Ram,25.0,35000.0,35025.0,True
3,Lovnish,,30000.0,,True
4,Ravi,28.0,,,False
5,Puneet,34.0,60000.0,60034.0,False


# Handling Duplicate Values

In [470]:
df.duplicated().sum()

np.int64(1)

In [471]:
# Droping duplicate value
df.drop_duplicates(inplace=True)

In [472]:
df

Unnamed: 0,name,age,salary,new_column,low_salary
0,Raju,32.0,50000.0,50032.0,False
1,Puneet,34.0,60000.0,60034.0,False
2,Ram,25.0,35000.0,35025.0,True
3,Lovnish,,30000.0,,True
4,Ravi,28.0,,,False


# Handling missing (null) values

In [473]:
df.isnull().sum()

Unnamed: 0,0
name,0
age,1
salary,1
new_column,2
low_salary,0


In [474]:
df

Unnamed: 0,name,age,salary,new_column,low_salary
0,Raju,32.0,50000.0,50032.0,False
1,Puneet,34.0,60000.0,60034.0,False
2,Ram,25.0,35000.0,35025.0,True
3,Lovnish,,30000.0,,True
4,Ravi,28.0,,,False


In [475]:
#droping missing values
df.dropna()

Unnamed: 0,name,age,salary,new_column,low_salary
0,Raju,32.0,50000.0,50032.0,False
1,Puneet,34.0,60000.0,60034.0,False
2,Ram,25.0,35000.0,35025.0,True


In [476]:
df

Unnamed: 0,name,age,salary,new_column,low_salary
0,Raju,32.0,50000.0,50032.0,False
1,Puneet,34.0,60000.0,60034.0,False
2,Ram,25.0,35000.0,35025.0,True
3,Lovnish,,30000.0,,True
4,Ravi,28.0,,,False


# filling missing values

In [477]:
# df.fillna(25)

In [478]:
df['age'].fillna(df['age'].mean(), inplace=True)
df['age'].fillna(df['salary'].median(), inplace=True)
# df['age'].fillna(df['name'].mode(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['age'].fillna(df['age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['age'].fillna(df['salary'].median(), inplace=True)


In [479]:
# filling missing name column
# df['name'].fillna('Unknown', inplace=True)

In [480]:
# fillng most occoured name in name column
# df['name'].fillna(df['name'].mode()[0], inplace=True)

In [481]:
df

Unnamed: 0,name,age,salary,new_column,low_salary
0,Raju,32.0,50000.0,50032.0,False
1,Puneet,34.0,60000.0,60034.0,False
2,Ram,25.0,35000.0,35025.0,True
3,Lovnish,29.75,30000.0,,True
4,Ravi,28.0,,,False


In [482]:
# df = df.ffill() # forward fill
# df = df.bfill() # backward fill

# Loop through rows

In [483]:
# Loop through rows
for index, row in df.iterrows():
    print(f"{row['age']} has Salary {row['salary']}")


32.0 has Salary 50000.0
34.0 has Salary 60000.0
25.0 has Salary 35000.0
29.75 has Salary 30000.0
28.0 has Salary nan


In [484]:
df

Unnamed: 0,name,age,salary,new_column,low_salary
0,Raju,32.0,50000.0,50032.0,False
1,Puneet,34.0,60000.0,60034.0,False
2,Ram,25.0,35000.0,35025.0,True
3,Lovnish,29.75,30000.0,,True
4,Ravi,28.0,,,False


# Saving (Exporting) Dataset

In [485]:
df.to_csv("armynewww.csv", index=False)