<a href="https://colab.research.google.com/github/guranshchugh-9/Machine-learning-basics/blob/main/Basic_Pandas_for_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📘 Fun with Pandas - Your First Step into Data Science!

Welcome to **Pandas** – the Swiss Army knife of data manipulation in Python! 🐼  
This interactive guide will teach you how to load, analyze, filter, clean, and summarize data effortlessly.  
Let’s go step by step and get you Pandas-proficient in no time!


### 🔹 1. Getting Started with Pandas

In [None]:
import pandas as pd

# Check the version you're using
print("Pandas version:", pd.__version__)

Pandas version: 2.2.2


> 🧠 **Fun Fact:** Pandas is short for *Panel Data*, not the animal – though the logo is a panda for fun!

### 🔹 2. Create Your First DataFrame

In [None]:
# column names given
data = {
    'Name':['Alice','raju','chotu motu','rafi'],
    'Age':[20,21,22,23],
    'City':['Delhi','Agra','Ludhiana',None]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,20,Delhi
1,raju,21,Agra
2,chotu motu,22,Ludhiana
3,rafi,23,


In [1]:
# try this
import pandas as pd
data = {
    'Title': ['Stranger Things', 'The Witcher', 'Money Heist', 'Breaking Bad', 'Wednesday'],
    'Genre': ['Sci-Fi', 'Fantasy', 'Thriller', 'Crime', 'Horror'],
    'Episodes Watched': [25, 8, 20, 50, 6],
    'Rating': [8.7, 8.2, 8.3, 9.5, 7.9]
}

df = pd.DataFrame(data)
df

NameError: name 'pd' is not defined

### 🔹 3. Reading Data from a CSV File

In [None]:
# Uncomment if you have a CSV
# df = pd.read_csv("your_dataset.csv")
# df.head()  # Show top 5 rows

> 💡 Use `.head()` to preview and `.tail()` to peek at the end.

### 🔹 4. Descriptive Statistics

In [None]:
df.describe()
# gives the quick summary of numerical values by default
# 25%, 50%, 75%: Percentiles (quartiles)

Unnamed: 0,Age
count,4.0
mean,21.5
std,1.290994
min,20.0
25%,20.75
50%,21.5
75%,22.25
max,23.0


### 🔹 5. Selecting Columns and Rows

In [None]:
# Column selection

# print(df['Name'])          # Series → like a single column
# print(df[['Name']])       # Still a DataFrame (1 column)
# print(df[['Name', 'City']]) # DataFrame with 2 columns

# print(df['Name'])
# df[['Name','City']]

# Multiple columns
# print(df[['Name', 'City']])

# Row selection by index
print(df.iloc[1])

# Row selection by condition
# print(df['Age']>20) #boolean
print(df[df['Age']>20])

Name    raju
Age       21
City    Agra
Name: 1, dtype: object
         Name  Age      City
1        raju   21      Agra
2  chotu motu   22  Ludhiana
3        rafi   23      None


In [None]:
df.iloc[1]
print(df.iloc[0:3:2])
# print(df.iloc[:,1])

         Name  Age      City
0       Alice   20     Delhi
2  chotu motu   22  Ludhiana


In [None]:
print(df.loc[0])         # Row where index is 0
df.loc[1:3]       # Rows from index 1 to 3 (inclusive!)
print(df.loc[:, 'Name']) # All rows, only the 'Name' column
df.loc[0, 'City'] # Value at row index 0, column 'City'

# ques- now print all rows of age

Name    Alice
Age        20
City    Delhi
Name: 0, dtype: object
0         Alice
1          raju
2    chotu motu
3          rafi
Name: Name, dtype: object
   Age
0   20
1   21
2   22
3   23


> 🧠 `.iloc` is index-based, `.loc` is label-based. Try both!

In [None]:
df['age after 5 yrs'] = df['Age'] + 5
df['Age'] = df['Age'] - 5
df

Unnamed: 0,Name,Age,City,age after 5 yrs
0,Alice,15,Delhi,25
1,raju,16,Agra,26
2,chotu motu,17,Ludhiana,27
3,rafi,18,,28


### 🔹 6. Adding & Modifying Columns



In [None]:
# Add a new column
df['Age after 5 Years'] = df['Age'] + 5
df

# Modify an existing one
df['Age'] = df['Age'] - 1

### 🔹 7. Deleting Columns or Rows

In [None]:
# Remove a column
df.drop('age after 5 yrs', axis=1, inplace=True)
df

# Remove a row
df.drop(2, axis=0, inplace=False)
df

Unnamed: 0,Name,Age,City
0,Alice,20,Delhi
1,raju,21,Agra
2,chotu motu,22,Ludhiana
3,rafi,23,


> 🚨 Always set `inplace=True` if you want to apply the changes to the original DataFrame.

### 🔹 8. Sorting Data

In [None]:
# df.sort_values(by='Age',ascending=True)
df.sort_values(by='Age',ascending=False)

Unnamed: 0,Name,Age,City
3,rafi,23,
2,chotu motu,22,Ludhiana
1,raju,21,Agra
0,Alice,20,Delhi


### 🔹 9. Grouping & Aggregation

In [None]:
# Group by City and calculate average age
df.groupby('City')['Age'].mean()

Unnamed: 0_level_0,Age
City,Unnamed: 1_level_1
Agra,21.0
Delhi,20.0
Ludhiana,22.0


### 🔹 10. Handling Missing Data (NaN)

In [None]:
# Simulate missing data
df.loc[1, 'Age'] = None
print("With NaN:\n", df)

# Fill with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
df

With NaN:
          Name   Age      City
0       Alice  20.0     Delhi
1        raju   NaN      Agra
2  chotu motu  22.0  Ludhiana
3        rafi  23.0      None


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)


Unnamed: 0,Name,Age,City
0,Alice,20.0,Delhi
1,raju,21.666667,Agra
2,chotu motu,22.0,Ludhiana
3,rafi,23.0,


### 🔹 11. Exporting Data

In [None]:
df.to_csv("cleaned_data.csv", index=False)
print("Data saved as cleaned_data.csv")

### 🔹 12. Bonus Tips & Tricks

In [None]:
# Value Counts
df['City'].value_counts()

# Apply a function
df['Age Group'] = df['Age'].apply(lambda x: 'Senior' if x > 30 else 'Young')
df

df['Age Group'] = df['Age'].apply(lambda x: 'Senior' if x>30 else 'Young')
df

df['Age Group'] = df['Age'].apply(lambda x: 'Senior' if x>30 else 'Young')

# Rename Columns
df.rename(columns={'Age': 'Age (Years)'}, inplace=True)

# Check for duplicates
df.duplicated().sum()
# checks if duplicated rows are available and then counts the total number of duplications

np.int64(0)

### 🔹 🎯 Final Challenge (Practice Time!)

Try this:
- Load a dataset of your choice from [Kaggle](https://kaggle.com)
- Clean the data: remove NaNs, drop duplicates
- Calculate groupwise metrics (mean/sum/count)
- Create new columns using custom logic
- Export cleaned data

> 🤯 **Fun Fact Before You Go!**
>
> Pandas is used by **Netflix**, **Spotify**, and **NASA** to analyze data. You're in good company! 🚀