<a href="https://colab.research.google.com/github/lovnishverma/Python-Getting-Started/blob/main/pandas_bdds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#🐼 Pandas Zero-to-Hero Guide

#A beginner-friendly, hands-on crash course to master the Pandas library for data analysis.

In [117]:
import pandas as pd

In [118]:
# Load the dataset
# Replace with your actual CSV file
# Use this sample if no dataset available: https://raw.githubusercontent.com/lovnishverma/datasets/refs/heads/main/testdata.csv
df = pd.read_csv('https://raw.githubusercontent.com/lovnishverma/datasets/refs/heads/main/testdata.csv')

| Expression      | Meaning                     |
| --------------- | --------------------------- |
| `df`            | Show full DataFrame         |
| `df.head()`     | First 5 rows                |
| `df.tail()`     | Last 5 rows                 |
| `df[1:4]`       | Rows 1 to 3                 |
| `df.iloc[:, :]` | All rows, all columns       |
| `df.iloc[:, 1]` | All rows, column at index 1 |
| `df.iloc[0, :]` | First row, all columns      |


🙈 1. First Look

In [119]:
print("\n First few rows:")
print(df.head())

print("\n Last few rows:")
print(df.tail())

print("\n Shape of DataFrame (rows, columns):")
print(df.shape)

print("\n Column names:")
print(df.columns)

print("\n Data types:")
print(df.dtypes)

print(df.info())       # Data types & missing values
print(df.describe())   # Summary statistics


 First few rows:
   rank discipline  phd  service   sex    salary
0  Prof          B   56     49.0  Male  186960.0
1  Prof          A   12      6.0  Male   93000.0
2   NaN          A   23     20.0  Male  110515.0
3  Prof          A   40     31.0   NaN  131205.0
4  Prof          B   20      NaN  Male  104800.0

 Last few rows:
         rank discipline  phd  service     sex    salary
75       Prof          B   18     10.0  Female  105450.0
76  AssocProf          B   19      6.0  Female  104542.0
77       Prof          B   17     17.0  Female  124312.0
78       Prof          A   28     14.0  Female  109954.0
79       Prof          A   23     15.0  Female  109646.0

 Shape of DataFrame (rows, columns):
(80, 6)

 Column names:
Index(['rank', 'discipline', 'phd', 'service', 'sex', 'salary'], dtype='object')

 Data types:
rank           object
discipline     object
phd             int64
service       float64
sex            object
salary        float64
dtype: object


✌ 2. Selecting Data

In [120]:
# Row slices
print(df[1:4])            # Rows 1 to 3
print(df.iloc[:, :])      # All rows, all columns
print(df.iloc[:, 1])      # All rows, column at index 1
print(df.iloc[0, :])      # First row, all columns

# Using .loc with labels
print(df.loc[0])
print(df.loc[1:3])



print(df.columns) # Before selecting columns, always confirm column names

# You can drop a column from the DataFrame using the .drop() method.
# df.drop(columns=['rank'], inplace=True)

# If you want to remove multiple columns, pass a list:
# df.drop(columns=['rank', 'service'], inplace=True)

df.loc[:, 'phd']  # All rows, column 'phd'
df.loc[0:2, ['rank', 'service']]

# Selecting specific columns
print(df.loc[:, 'phd'])

print(df.loc[0:2, ['rank', 'service']])


   rank discipline  phd  service   sex    salary
1  Prof          A   12      6.0  Male   93000.0
2   NaN          A   23     20.0  Male  110515.0
3  Prof          A   40     31.0   NaN  131205.0
         rank discipline  phd  service     sex    salary
0        Prof          B   56     49.0    Male  186960.0
1        Prof          A   12      6.0    Male   93000.0
2         NaN          A   23     20.0    Male  110515.0
3        Prof          A   40     31.0     NaN  131205.0
4        Prof          B   20      NaN    Male  104800.0
..        ...        ...  ...      ...     ...       ...
75       Prof          B   18     10.0  Female  105450.0
76  AssocProf          B   19      6.0  Female  104542.0
77       Prof          B   17     17.0  Female  124312.0
78       Prof          A   28     14.0  Female  109954.0
79       Prof          A   23     15.0  Female  109646.0

[80 rows x 6 columns]
0     B
1     A
2     A
3     A
4     B
     ..
75    B
76    B
77    B
78    A
79    A
Name: dis

🎯 3. Filtering Rows

In [121]:
# Selecting & Filtering Data (Selecting Columns)

# print(df['rank'])  # Selecting a single column
# print(df[['phd', 'service']])  # Selecting multiple columns

# Example: Filter rows with numerical service > 45 (only if applicable)
print(df[df['service'] > 45])


    rank discipline  phd  service   sex    salary
0   Prof          B   56     49.0  Male  186960.0
10  Prof          A   51     51.0  Male   57800.0


📊 4. Sorting Data

In [122]:
print(df.sort_values(by='service', ascending=True))
print(df.sort_values(by='service', ascending=False))

        rank discipline  phd  service     sex    salary
13  AsstProf          B    1      0.0    Male   88000.0
14  AsstProf          B    1      0.0    Male   88000.0
25  AsstProf          A    2      0.0    Male   85000.0
19  AsstProf          B    4      0.0    Male   92000.0
54      Prof          A   12      0.0  Female  105000.0
..       ...        ...  ...      ...     ...       ...
29      Prof          A   45     43.0    Male  155865.0
38      Prof          B   45     45.0    Male  146856.0
0       Prof          B   56     49.0    Male  186960.0
10      Prof          A   51     51.0    Male   57800.0
4       Prof          B   20      NaN    Male  104800.0

[80 rows x 6 columns]
        rank discipline  phd  service     sex    salary
10      Prof          A   51     51.0    Male   57800.0
0       Prof          B   56     49.0    Male  186960.0
38      Prof          B   45     45.0    Male  146856.0
29      Prof          A   45     43.0    Male  155865.0
42      Prof          A  

🆕 5. Creating New Columns

In [123]:
df['is_senior'] = df['rank'] == 'Prof'  # Example boolean logic
print(df[['rank', 'is_senior']])

         rank  is_senior
0        Prof       True
1        Prof       True
2         NaN      False
3        Prof       True
4        Prof       True
..        ...        ...
75       Prof       True
76  AssocProf      False
77       Prof       True
78       Prof       True
79       Prof       True

[80 rows x 2 columns]


🧮 6. Aggregation / Grouping

In [124]:
print(df.groupby('service')['phd'].mean())

service
0.0      3.857143
1.0      3.000000
2.0      4.750000
3.0      8.000000
4.0      4.000000
5.0     10.000000
6.0     12.666667
7.0     17.500000
8.0     13.750000
9.0     12.000000
10.0    14.333333
11.0    14.000000
14.0    24.000000
15.0    23.500000
17.0    18.000000
18.0    17.800000
19.0    29.666667
20.0    21.333333
21.0    22.000000
22.0    25.000000
23.0    27.000000
24.0    26.000000
25.0    25.000000
26.0    36.000000
27.0    29.000000
30.0    33.000000
31.0    37.500000
33.0    37.000000
36.0    39.000000
43.0    45.000000
45.0    45.000000
49.0    56.000000
51.0    51.000000
Name: phd, dtype: float64


🛠️ 7. Renaming Columns (if needed)

In [125]:
# df.rename(columns={'rank': 'Rank', 'service': 'Service'}, inplace=True)
print(df.columns)

Index(['rank', 'discipline', 'phd', 'service', 'sex', 'salary', 'is_senior'], dtype='object')


🧽 8. Handling Duplicates

In [126]:
print("\n Checking for duplicate rows:")
print(df.duplicated().sum())

# Drop duplicates if any
df.drop_duplicates(inplace=True)
print("\n Shape after dropping duplicates:")
print(df.shape)



 Checking for duplicate rows:
2

 Shape after dropping duplicates:
(78, 7)


❓ 9. Missing Values Handling

In [127]:
print("\n Missing values per column:")
print(df.isnull().sum())

# Fill numeric columns with statistical values

df['phd'] = df['phd'].fillna(df['phd'].mean())     # Fill with mean (average)
df['phd'] = df['phd'].fillna(df['phd'].median())   # Fill with median
df['phd'] = df['phd'].fillna(df['phd'].mode()[0])  # Fill with mode (most frequent)

# Forward / Backward fill using modern syntax
# Useful in time series or ordered data
df = df.ffill()  # Forward fill (previous value)
df = df.bfill()  # Backward fill (next value)

# Drop rows/columns with NaN (uncomment if needed)
# df.dropna(inplace=True)
# df.dropna(axis=1, inplace=True)

# Final missing value report
print("\n Missing values after cleaning:")
print(df.isnull().sum())


 Missing values per column:
rank          1
discipline    1
phd           0
service       1
sex           1
salary        1
is_senior     0
dtype: int64

 Missing values after cleaning:
rank          0
discipline    0
phd           0
service       0
sex           0
salary        0
is_senior     0
dtype: int64


🔁 10. Looping Through Rows (not recommended for big data)

In [128]:
# Loop through rows

# for index, row in df.iterrows():
#     print(f"{row['Service']} officer has rank {row['Rank']}")

🔚 11. Export Cleaned Data

In [129]:
df.to_csv('cleaned_testdata.csv', index=False)
print("\n Cleaned data saved to 'cleaned_testdata.csv'")


 Cleaned data saved to 'cleaned_testdata.csv'
