## 1. Introduction to DataFrame

- A DataFrame is a 2D tabular structure (rows × columns).

- Think of it as:

    - Excel sheet in Python.

    - SQL table in memory.

    - Dataset you’ll feed to ML models.

- Why important in ML?
    - Most datasets (CSV, JSON, SQL dumps) are read into Pandas DataFrames before cleaning, analyzing, and training.

In [None]:
import pandas as pd
import numpy as np

## 2. Creating DataFrames

- From a dictionary of lists/arrays.

- From list of dicts.

- From NumPy arrays.

- From Series.

In [None]:
# From dict of lists
df1 = pd.DataFrame({
    "Name": ["A", "B", "C"],
    "Age": [21, 22, 23],
    "Score": [85, 90, 95]
})

# From list of dicts
df2 = pd.DataFrame([
    {"Name": "A", "Age": 21, "Score": 85},
    {"Name": "B", "Age": 22, "Score": 90}
])

# From NumPy array
arr = np.arange(9).reshape(3,3)
df3 = pd.DataFrame(arr, columns=["A","B","C"])


## 3. Inspecting a DataFrame

In [None]:
df1.head()       # first 5 rows
df1.tail(3)      # last 3 rows
df1.shape        # (rows, cols)
df1.columns      # column names
df1.info()       # data types + memory usage
df1.describe()   # summary stats

# 🔑 ML relevance: info() helps spot categorical vs numerical, missing data, etc.

## 4. Selecting Data

- Column selection: df["col"], df[["col1","col2"]]

- Row selection: df.loc[rows, cols], df.iloc[rows, cols]

In [None]:
df1["Name"]            # Single column
df1[["Name", "Age"]]   # Multiple columns

df1.loc[0]             # First row (by label)
df1.iloc[0]            # First row (by index position)
df1.loc[0, "Age"]      # Specific cell


## 5. Filtering & Boolean Indexing

In [None]:
df1[df1["Age"] > 21]              # filter
df1[(df1["Age"] > 21) & (df1["Score"] > 85)]  # multiple conditions


## 6. Adding / Modifying Columns

In [None]:
df1["Pass"] = df1["Score"] > 80
df1["DoubleAge"] = df1["Age"] * 2


## 7. Dropping Rows/Columns

In [None]:
df1.drop("DoubleAge", axis=1, inplace=True)  # drop column
df1.drop(1, axis=0, inplace=True)            # drop row


## 8. Handling Missing Data (Critical for ML 🚨)

In [None]:
df1.isna().sum()                # count missing
df1.fillna(df1.mean(), inplace=True)  # fill with mean
df1.dropna(inplace=True)        # drop rows with NaN

# 🔑 ML relevance: Missing values break ML models — you’ll fix them here.

## 9. Aggregations & GroupBy (Split → Apply → Combine)

In [23]:
df1.groupby("Name")["Score"].mean()
df1.groupby("Name").agg({"Score": ["mean","max"], "Age": "min"})

# 🔑 Used for summarizing datasets before training.

Unnamed: 0_level_0,Score,Score,Age
Unnamed: 0_level_1,mean,max,min
Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
A,85.0,85,21
C,95.0,95,23


## 10. Sorting & Ranking

In [None]:
df.sort_values(by="Age")         # ascending
df.sort_values(by="Salary", ascending=False)
df.sort_index()


## 11. Merge, Join, Concat (SQL-style)

In [25]:
df_a = pd.DataFrame({"ID": [1,2], "Name": ["A","B"]})
df_b = pd.DataFrame({"ID": [1,2], "Score": [85,90]})

pd.merge(df_a, df_b, on="ID", how="inner")   # SQL-style joins
pd.concat([df_a, df_b], axis=0)              # stacking rows


Unnamed: 0,ID,Name,Score
0,1,A,
1,2,B,
0,1,,85.0
1,2,,90.0


## 11. Apply, Map, Applymap (Custom Ops)

In [26]:
# Apply function to column
df1["Age"].apply(lambda x: x+1)

# Apply element-wise to Series
df1["Name"].map(str.upper)

# Apply to entire DataFrame
df1.applymap(lambda x: str(x).upper())


  df1.applymap(lambda x: str(x).upper())


Unnamed: 0,Name,Age,Score,Pass
0,A,21,85,True
2,C,23,95,True


## 12. String Operations

In [27]:
df1["Name"].str.lower()
df1["Name"].str.contains("a")
# 🔑 Useful for cleaning messy text data.

0    False
2    False
Name: Name, dtype: bool

## 13. Datetime Operations

In [None]:
df_dates = pd.DataFrame({
    "date": pd.to_datetime(["2021-01-01", "2021-06-01"])
})
df_dates["year"] = df_dates["date"].dt.year

# 🔑 Essential for time-series ML.

## 14. Handling Categorical & Text Data

In [None]:
df["Department"].unique()       # unique categories
df["Department"].value_counts() # frequency
df["Department"].astype("category")

# String ops
df["Name"].str.upper()
df["Name"].str.contains("li")


## 15. Indexing Tricks

In [None]:
df.set_index("Name", inplace=True)
df.reset_index(inplace=True)
df.rename(columns={"Salary":"Income"}, inplace=True)


## 16. Importing & Exporting Data

In [None]:
# CSV
pd.read_csv("data.csv")
df.to_csv("output.csv", index=False)

# Excel
pd.read_excel("data.xlsx")
df.to_excel("output.xlsx", index=False)

# JSON
pd.read_json("data.json")
df.to_json("output.json", orient="records")
