## 1. Introduction to DataFrame

- A DataFrame is a 2D tabular structure (rows × columns).

- Think of it as:

    - Excel sheet in Python.

    - SQL table in memory.

    - Dataset we’ll feed to ML models.

- Why important in ML?
    - Most datasets (CSV, JSON, SQL dumps) are read into Pandas DataFrames before cleaning, analyzing, and training.

In [57]:
import pandas as pd
import numpy as np

## 2. Creating DataFrames

- From a dictionary of lists/arrays.

- From list of dicts.

- From NumPy arrays.

- From Series.

In [58]:
# From dict of lists
df1 = pd.DataFrame({
    "Name": ["A", "B", "C"],
    "Age": [21, 22, 23],
    "Score": [85, 90, 95]
})

# From list of dicts
df2 = pd.DataFrame([
    {"Name": "A", "Age": 21, "Score": 85},
    {"Name": "B", "Age": 22, "Score": 90}
])

# From NumPy array
arr = np.arange(9).reshape(3,3)
df3 = pd.DataFrame(arr, columns=["A","B","C"])


## 3. Inspecting a DataFrame

- Now that we can create DataFrames, the next skill is looking inside them quickly.
In ML, when we load a dataset (say, 10k rows), we won’t scroll through everything. Instead, we’ll peek, summarize, and sanity-check.

In [59]:
df1.head()       # first 5 rows
df1.tail(3)      # last 3 rows
df1.shape        # (rows, cols)
df1.columns      # column names
df1.info()       # data types + memory usage
df1.describe()   # summary stats

# 🔑 ML relevance: info() helps spot categorical vs numerical, missing data, etc.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   Score   3 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 204.0+ bytes


Unnamed: 0,Age,Score
count,3.0,3.0
mean,22.0,90.0
std,1.0,5.0
min,21.0,85.0
25%,21.5,87.5
50%,22.0,90.0
75%,22.5,92.5
max,23.0,95.0


## 4. Selecting Data

- Column selection: df["col"], df[["col1","col2"]]

- Row selection: df.loc[rows, cols], df.iloc[rows, cols]

In [65]:
df1["Name"]            # Single column
df1[["Name", "Age"]]   # Multiple columns

df1.loc[0]             # First row (by label)
df1.iloc[0]            # First row (by index position)
df1.loc[0, "Age"]      # Specific cell


np.int64(21)

## 5. Filtering & Boolean Indexing

In [61]:
df1[df1["Age"] > 21]              # filter
df1[(df1["Age"] > 21) & (df1["Score"] > 85)]  # multiple conditions


Unnamed: 0,Name,Age,Score
1,B,22,90
2,C,23,95


## 6. Adding / Modifying Columns

In [62]:
df1["Pass"] = df1["Score"] > 80
df1["DoubleAge"] = df1["Age"] * 2

## 7. Dropping Rows/Columns

In [63]:
df1.drop("DoubleAge", axis=1, inplace=True)  # drop column
df1.drop(1, axis=0, inplace=True)            # drop row


## 8. Handling Missing Data (Critical for ML 🚨)

In [64]:
df1.isna().sum()                # count missing
df1.fillna(df1.mean(), inplace=True)  # fill with mean
df1.dropna(inplace=True)        # drop rows with NaN

# 🔑 ML relevance: Missing values break ML models — we’ll fix them here.

TypeError: Could not convert ['AC'] to numeric

## 9. Aggregations & GroupBy (Split → Apply → Combine)

In [None]:
df1.groupby("Name")["Score"].mean()
df1.groupby("Name").agg({"Score": ["mean","max"], "Age": "min"})

# 🔑 Used for summarizing datasets before training.

## 10. Sorting & Ranking

In [None]:
df.sort_values(by="Age")         # ascending
df.sort_values(by="Salary", ascending=False)
df.sort_index()


## 11. Merge, Join, Concat (SQL-style)

In [None]:
df_a = pd.DataFrame({"ID": [1,2], "Name": ["A","B"]})
df_b = pd.DataFrame({"ID": [1,2], "Score": [85,90]})

pd.merge(df_a, df_b, on="ID", how="inner")   # SQL-style joins
pd.concat([df_a, df_b], axis=0)              # stacking rows

## 11. Apply, Map, Applymap (Custom Ops)

In [None]:
# Apply function to column
df1["Age"].apply(lambda x: x+1)

# Apply element-wise to Series
df1["Name"].map(str.upper)

# Apply to entire DataFrame
df1.applymap(lambda x: str(x).upper())

## 12. String Operations

In [None]:
df1["Name"].str.lower()
df1["Name"].str.contains("a")
# 🔑 Useful for cleaning messy text data.

## 13. Datetime Operations

In [None]:
df_dates = pd.DataFrame({
    "date": pd.to_datetime(["2021-01-01", "2021-06-01"])
})
df_dates["year"] = df_dates["date"].dt.year

# 🔑 Essential for time-series ML.

## 14. Handling Categorical & Text Data

In [None]:
df["Department"].unique()       # unique categories
df["Department"].value_counts() # frequency
df["Department"].astype("category")

# String ops
df["Name"].str.upper()
df["Name"].str.contains("li")

## 15. Indexing Tricks

In [None]:
df.set_index("Name", inplace=True)
df.reset_index(inplace=True)
df.rename(columns={"Salary":"Income"}, inplace=True)

## 16. Importing & Exporting Data

In [None]:
# CSV
pd.read_csv("data.csv")
df.to_csv("output.csv", index=False)

# Excel
pd.read_excel("data.xlsx")
df.to_excel("output.xlsx", index=False)

# JSON
pd.read_json("data.json")
df.to_json("output.json", orient="records")
