# `DataFrame` – Two-dimensional labeled table

In [19]:
import pandas as pd
import numpy as np

A **DataFrame** is a **2-dimensional labeled data structure** with:

* **Rows**: indexed with numbers or labels
* **Columns**: each with a label (like a named variable)
* Think of it like an **Excel spreadsheet**, **SQL table**, or a collection of **Series** objects sharing the same index.

## Syntax

```python
import pandas as pd

df = pd.DataFrame(data, index=None, columns=None, dtype=None)
```

* `data`: Can be a dictionary, list of lists, NumPy array, Series, etc.
* `index`: Custom row labels
* `columns`: Custom column labels
* `dtype`: Specify data type


## Common Ways to Create a DataFrame

### 1. From a Dictionary of Lists

In [14]:
data = {
    'Name': ['Mahesh', 'Rajesh', 'Ravi'],
    'Age': [20, 35, 25],
    'Salary': [40, 55, 50]
}

df = pd.DataFrame(data)
print(df)

     Name  Age  Salary
0  Mahesh   20      40
1  Rajesh   35      55
2    Ravi   25      50


✅ **Most common** method, especially for structured data.

### 2. From a List of Dictionaries

In [15]:
data = [
    {'Name': 'Mahesh', 'Age': 25},
    {'Name': 'Rajesh', 'Age': 45},
    {'Name': 'Ramesh', 'Age': 30},
]

df = pd.DataFrame(data)
print(df)

     Name  Age
0  Mahesh   25
1  Rajesh   45
2  Ramesh   30


In [16]:
data = [
    {'Name': 'Mahesh', 'Age': 25},
    {'Name': 'Rajesh', 'Age': 45},
    {'Age': 30},
]

df = pd.DataFrame(data)
print(df)

     Name  Age
0  Mahesh   25
1  Rajesh   45
2     NaN   30


✅ Great when reading from JSON or APIs.

### 3. From a Dictionary of Series

In [17]:
data = {
    'Marks': pd.Series([85, 90, 95], index=['Mahesh', 'Manju', "Malli"]),
    'Grade': pd.Series(['B+', 'A', 'A+'], index=['Mahesh', 'Manju', "Malli"])
}

df = pd.DataFrame(data)
print(df)

        Marks Grade
Mahesh     85    B+
Manju      90     A
Malli      95    A+


✅ Useful for labeled data or timeseries.

### 4. From a List of Lists / Tuples (with column names)

In [18]:
data = [
    ['Mahesh', 20],
    ['Malli', 25],
    ['Suraj', 30]
]

df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

     Name  Age
0  Mahesh   20
1   Malli   25
2   Suraj   30


✅ Often used after loading CSV or tabular data manually.

### 5. From a NumPy Array

In [20]:
arr = np.array([[10, 20], [30, 40]])

df = pd.DataFrame(arr, columns=['A', 'B'])
print(df)

    A   B
0  10  20
1  30  40


✅ Good for numeric or matrix data.

### 6. From a Series (Single Column)

In [23]:
df = pd.DataFrame([10, 20, 30], index=['a', 'b', 'c'])
print(df)
print(type(df))

    0
a  10
b  20
c  30
<class 'pandas.core.frame.DataFrame'>


✅ Converts a Series into a DataFrame.

### 7. From a Scalar (Same value for all rows)

In [26]:
df = pd.DataFrame(10, index=['a', 'b', 'c'], columns=['X', 'Y'])
print(df)

    X   Y
a  10  10
b  10  10
c  10  10


✅ Useful for testing or initializing default values.

### 8. From CSV / Excel / SQL (Real-world)

```python
# From CSV
df = pd.read_csv('data.csv')

# From Excel
df = pd.read_excel('data.xlsx')

# From SQL
import sqlite3
conn = sqlite3.connect('db.sqlite')
df = pd.read_sql_query("SELECT * FROM customers", conn)
```

✅ Very common in data science projects.

### 9. From Zipped/Compressed/Remote CSVs

```python
df = pd.read_csv("https://example.com/data.csv", compression='gzip')
```

✅ Works directly with web or zipped data.

### 10. Creating an Empty DataFrame

In [None]:
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


✅ Useful as a starting point to append rows later.

### 11. With Custom Index and Columns

In [32]:
data = [[1, 2], [3, 4]]
df = pd.DataFrame(data, index=['row1', 'row2'], columns=['A', 'B'])
print(df)

      A  B
row1  1  2
row2  3  4


✅ Used for labeled datasets.

## 12. Creating Time-Based DataFrame

In [33]:
dates = pd.date_range(start='2025-06-11', periods=5)
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]}, index=dates)
print(df)

            Values
2025-06-11      10
2025-06-12      20
2025-06-13      30
2025-06-14      40
2025-06-15      50


✅ Useful in time series forecasting and analysis.

# 🧠 Recap of Possible Data Inputs

| Input Type     | Description                          | Example                        |
| -------------- | ------------------------------------ | ------------------------------ |
| dict of lists  | Most common structured data          | `{'A': [1, 2]}`                |
| list of dicts  | For API or JSON-like data            | `[{'A': 1}, {'A': 2}]`         |
| dict of Series | For labeled/multi-indexed data       | `{'A': pd.Series(...), ...}`   |
| list of lists  | Matrix-like data                     | `[[1, 2], [3, 4]]`             |
| NumPy array    | For numeric datasets                 | `np.array(...)`                |
| Series         | Single column wrapped into DataFrame | `pd.DataFrame(pd.Series(...))` |
| CSV/Excel/SQL  | Real-world external data             | `pd.read_csv(...)`             |
| Scalar         | Same value in all cells              | `pd.DataFrame(0, ...)`         |
| Time index     | For time-series datasets             | `pd.date_range(...)`           |

### 🔍 Core Concepts of DataFrame

| Concept               | Example                          | Explanation                             |
| --------------------- | -------------------------------- | --------------------------------------- |
| Rows & Columns        | `df.shape`                       | Dimensions of data                      |
| Index (row labels)    | `df.index`                       | Row identifiers                         |
| Columns (labels)      | `df.columns`                     | Column names                            |
| Data types            | `df.dtypes`                      | Type of each column                     |
| Summary statistics    | `df.describe()`                  | Mean, std, min, max, etc.               |
| Column access         | `df['Age']` or `df.Age`          | Access specific column (returns Series) |
| Row access            | `df.loc[0]`, `df.iloc[0]`        | Get rows by index label or position     |
| Boolean filtering     | `df[df['Age'] > 30]`             | Filter rows based on condition          |
| Adding columns        | `df['Tax'] = df['Salary'] * 0.1` | Feature engineering                     |
| Dropping columns      | `df.drop('Age', axis=1)`         | Data cleaning                           |
| Sorting data          | `df.sort_values('Salary')`       | Sorting rows                            |
| Handling missing data | `df.dropna()`, `df.fillna()`     | Data imputation                         |

---

### ✅ Indexing Types

| Access Method | Description                | Example                    |
| ------------- | -------------------------- | -------------------------- |
| `df.loc[]`    | Access by label            | `df.loc[0]`, `df.loc['a']` |
| `df.iloc[]`   | Access by integer position | `df.iloc[0:2]`             |


**Will cover all data frame operation in later sections**

<center><b>Thanks</b></center>