# DataFrame Creation in Pandas

## What is a DataFrame?
A **DataFrame** is a 2-dimensional labeled data structure with:
- Columns of potentially different types
- Labeled axes (rows and columns)
- Size-mutable (can change size)
- Tabular structure (like a spreadsheet or SQL table)

## Key Characteristics:
- 2D structure (rows × columns)
- Columns can have different data types
- Labeled rows (index) and columns
- Powerful data manipulation capabilities

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

## 1. Creating DataFrame from Dictionary of Lists

### Structure:
- Keys become column names
- Values become column data (as lists)
- All lists must have the same length

In [3]:
# Example 1: Basic DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'London', 'Tokyo', 'Paris'],
    'Salary': [50000, 60000, 75000, 55000]
}

df1 = pd.DataFrame(data)
print("DataFrame from dictionary of lists:")
print(df1)
print(f"\nShape: {df1.shape}")
print(f"Columns: {df1.columns.tolist()}")
print()

# Example 2: With custom index
df2 = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])
print("DataFrame with custom index:")
print(df2)

DataFrame from dictionary of lists:
      Name  Age      City  Salary
0    Alice   25  New York   50000
1      Bob   30    London   60000
2  Charlie   35     Tokyo   75000
3    Diana   28     Paris   55000

Shape: (4, 4)
Columns: ['Name', 'Age', 'City', 'Salary']

DataFrame with custom index:
      Name  Age      City  Salary
A    Alice   25  New York   50000
B      Bob   30    London   60000
C  Charlie   35     Tokyo   75000
D    Diana   28     Paris   55000


## 2. Creating DataFrame from List of Dictionaries

### Structure:
- Each dictionary represents a row
- Dictionary keys become column names
- Missing values become NaN

In [4]:
# Example: List of dictionaries (each dict = one row)
employees = [
    {'Name': 'Alice', 'Department': 'HR', 'Experience': 3},
    {'Name': 'Bob', 'Department': 'Engineering', 'Experience': 5, 'Manager': 'Eve'},
    {'Name': 'Charlie', 'Department': 'Marketing', 'Experience': 2},
    {'Name': 'Diana', 'Department': 'Engineering', 'Experience': 7, 'Manager': 'Frank'}
]

df3 = pd.DataFrame(employees)
print("DataFrame from list of dictionaries:")
print(df3)
print(f"\nNote: Missing values become NaN (see 'Manager' column)")
print(f"Data types:\n{df3.dtypes}")

DataFrame from list of dictionaries:
      Name   Department  Experience Manager
0    Alice           HR           3     NaN
1      Bob  Engineering           5     Eve
2  Charlie    Marketing           2     NaN
3    Diana  Engineering           7   Frank

Note: Missing values become NaN (see 'Manager' column)
Data types:
Name          object
Department    object
Experience     int64
Manager       object
dtype: object


## 3. Creating DataFrame from List of Lists

### Structure:
- Each inner list represents a row
- Must specify column names separately
- All rows must have the same number of elements

In [5]:
# Example: List of lists (tabular data)
sales_data = [
    [101, 'Laptop', 1200, 15],
    [102, 'Mouse', 25, 120],
    [103, 'Keyboard', 45, 85],
    [104, 'Monitor', 350, 20],
    [105, 'Tablet', 600, 35]
]

columns = ['Product_ID', 'Product_Name', 'Price', 'Quantity']

df4 = pd.DataFrame(sales_data, columns=columns)
print("DataFrame from list of lists:")
print(df4)
print()

# With custom index
df5 = pd.DataFrame(sales_data, columns=columns, index=[f'S{i}' for i in range(1, 6)])
print("With custom string index:")
print(df5)

DataFrame from list of lists:
   Product_ID Product_Name  Price  Quantity
0         101       Laptop   1200        15
1         102        Mouse     25       120
2         103     Keyboard     45        85
3         104      Monitor    350        20
4         105       Tablet    600        35

With custom string index:
    Product_ID Product_Name  Price  Quantity
S1         101       Laptop   1200        15
S2         102        Mouse     25       120
S3         103     Keyboard     45        85
S4         104      Monitor    350        20
S5         105       Tablet    600        35


## 4. Creating DataFrame from NumPy Arrays

### Advantages:
- Efficient for numerical data
- Direct integration with NumPy operations
- Good for large datasets

In [7]:
# Example 1: 2D NumPy array
np_array = np.array([
    [1, 2.5, 'A'],
    [2, 3.7, 'B'],
    [3, 4.2, 'C'],
    [4, 5.1, 'D']
])

df6 = pd.DataFrame(np_array, columns=['ID', 'Value', 'Category'])
print("DataFrame from 2D NumPy array:")
print(df6)
print(f"\nData types: {df6.dtypes.tolist()}")
print("Note: Mixed types in array result in object dtype")
print()

# Example 2: Dictionary of NumPy arrays
np_data = {
    'X': np.random.randn(5),
    'Y': np.random.randint(1, 100, 5),
    'Z': np.linspace(0, 1, 5)
}

df7 = pd.DataFrame(np_data, index=[f'Obs_{i}' for i in range(1, 6)])
print("DataFrame from dictionary of NumPy arrays:")
print(df7)

DataFrame from 2D NumPy array:
  ID Value Category
0  1   2.5        A
1  2   3.7        B
2  3   4.2        C
3  4   5.1        D

Data types: [dtype('O'), dtype('O'), dtype('O')]
Note: Mixed types in array result in object dtype

DataFrame from dictionary of NumPy arrays:
              X   Y     Z
Obs_1 -1.244063  65  0.00
Obs_2 -1.192360  59  0.25
Obs_3  0.261215  19  0.50
Obs_4  0.860722  84  0.75
Obs_5  0.560904   7  1.00


## 5. Creating DataFrame from Series

### Methods:
- Dictionary of Series (each Series becomes a column)
- List of Series (each Series becomes a row)
- Single Series (becomes single column DataFrame)

In [8]:
# Create individual Series
names = pd.Series(['Alice', 'Bob', 'Charlie', 'Diana'], name='Name')
ages = pd.Series([25, 30, 35, 28], name='Age')
departments = pd.Series(['HR', 'Engineering', 'Marketing', 'Engineering'], name='Department')

# Method 1: Dictionary of Series
df8 = pd.DataFrame({'Name': names, 'Age': ages, 'Department': departments})
print("DataFrame from dictionary of Series:")
print(df8)
print()

# Method 2: List of Series (each Series becomes a row)
row1 = pd.Series(['Alice', 25, 'HR'], index=['Name', 'Age', 'Dept'])
row2 = pd.Series(['Bob', 30, 'Engineering'], index=['Name', 'Age', 'Dept'])
row3 = pd.Series(['Charlie', 35, 'Marketing'], index=['Name', 'Age', 'Dept'])

df9 = pd.DataFrame([row1, row2, row3])
print("DataFrame from list of Series (rows):")
print(df9)

DataFrame from dictionary of Series:
      Name  Age   Department
0    Alice   25           HR
1      Bob   30  Engineering
2  Charlie   35    Marketing
3    Diana   28  Engineering

DataFrame from list of Series (rows):
      Name  Age         Dept
0    Alice   25           HR
1      Bob   30  Engineering
2  Charlie   35    Marketing


## 6. Creating Empty and Special DataFrames

### Useful for:
- Initializing structure before filling with data
- Creating placeholder DataFrames
- Testing and prototyping

In [10]:
# Empty DataFrame
empty_df = pd.DataFrame()
print("Empty DataFrame:")
print(empty_df)
print(f"Shape: {empty_df.shape}")
print(f"Columns: {empty_df.columns.tolist()}")
print()

# DataFrame with columns but no data
columns_only = pd.DataFrame(columns=['Name', 'Age', 'City'])
print("DataFrame with only columns:")
print(columns_only)
print(f"Shape: {columns_only.shape}")
print()

# DataFrame with zeros
zeros_df = pd.DataFrame(np.zeros((3, 4)), columns=['A', 'B', 'C', 'D'])
print("DataFrame of zeros:")
print(zeros_df)
print()

# DataFrame with ones
ones_df = pd.DataFrame(np.ones((2, 3)), columns=['Col1', 'Col2', 'Col3'])
print("DataFrame of ones:")
print(ones_df)
print()

# DataFrame with random values
random_df = pd.DataFrame(np.random.rand(4, 3), columns=['Random1', 'Random2', 'Random3'])
print("DataFrame with random values (0-1):")
print(random_df)

Empty DataFrame:
Empty DataFrame
Columns: []
Index: []
Shape: (0, 0)
Columns: []

DataFrame with only columns:
Empty DataFrame
Columns: [Name, Age, City]
Index: []
Shape: (0, 3)

DataFrame of zeros:
     A    B    C    D
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0

DataFrame of ones:
   Col1  Col2  Col3
0   1.0   1.0   1.0
1   1.0   1.0   1.0

DataFrame with random values (0-1):
    Random1   Random2   Random3
0  0.795986  0.773293  0.235445
1  0.092216  0.917553  0.756958
2  0.163788  0.544988  0.099747
3  0.920535  0.382086  0.968367


## 7. Creating DataFrame with Range Data

### Using pd.date_range():
- Creates time series data
- Useful for temporal analysis
- Flexible frequency options

In [11]:
# Create date range
dates = pd.date_range(start='2023-01-01', periods=10, freq='D')
print("Date Range:")
print(dates)
print()

# Create DataFrame with date index
df10 = pd.DataFrame({
    'Temperature': np.random.randint(60, 90, 10),
    'Humidity': np.random.randint(30, 80, 10),
    'Pressure': np.random.normal(1013, 5, 10)
}, index=dates)

print("Time Series DataFrame:")
print(df10)
print(f"\nIndex type: {type(df10.index)}")
print(f"Index name: {df10.index.name}")
print()

# Add column with day names
df10['Day'] = df10.index.day_name()
print("With day names:")
print(df10.head())

Date Range:
DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08',
               '2023-01-09', '2023-01-10'],
              dtype='datetime64[ns]', freq='D')

Time Series DataFrame:
            Temperature  Humidity     Pressure
2023-01-01           78        63  1009.732596
2023-01-02           87        77  1011.258144
2023-01-03           71        64  1012.061780
2023-01-04           81        72  1017.084595
2023-01-05           76        49  1011.899724
2023-01-06           71        36  1013.134371
2023-01-07           75        74  1016.521574
2023-01-08           86        42  1019.404313
2023-01-09           62        43  1017.036320
2023-01-10           85        52  1014.027753

Index type: <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
Index name: None

With day names:
            Temperature  Humidity     Pressure        Day
2023-01-01           78        63  1009.732596     Sun

## 8. Creating DataFrame from External Files

### Common File Formats:
- CSV (pd.read_csv())
- Excel (pd.read_excel())
- JSON (pd.read_json())
- SQL (pd.read_sql())

In [12]:
# Note: These examples show syntax. Files need to exist in your environment.

# Example 1: CSV file
# df_csv = pd.read_csv('data.csv')
print("CSV Example: pd.read_csv('filename.csv')")

# Example 2: Excel file
# df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print("Excel Example: pd.read_excel('filename.xlsx', sheet_name='Sheet1')")

# Example 3: JSON file
# df_json = pd.read_json('data.json')
print("JSON Example: pd.read_json('filename.json')")

# Example 4: Creating sample CSV data for demonstration
sample_csv_data = """Name,Age,City,Salary
Alice,25,New York,50000
Bob,30,London,60000
Charlie,35,Tokyo,75000
Diana,28,Paris,55000"""

# Read from string (similar to reading file)
import io
df11 = pd.read_csv(io.StringIO(sample_csv_data))
print("\nSample DataFrame from CSV string:")
print(df11)

CSV Example: pd.read_csv('filename.csv')
Excel Example: pd.read_excel('filename.xlsx', sheet_name='Sheet1')
JSON Example: pd.read_json('filename.json')

Sample DataFrame from CSV string:
      Name  Age      City  Salary
0    Alice   25  New York   50000
1      Bob   30    London   60000
2  Charlie   35     Tokyo   75000
3    Diana   28     Paris   55000


## 9. DataFrame Attributes for Inspection

### Key Attributes:
- .shape: (rows, columns)
- .columns: column names
- .index: row labels
- .dtypes: data types per column
- .info(): concise summary
- .describe(): statistical summary

In [13]:
# Create a sample DataFrame
sample_df = pd.DataFrame({
    'Product': ['A', 'B', 'C', 'D', 'E'],
    'Price': [100, 200, 150, 300, 250],
    'Quantity': [10, 5, 8, 3, 6],
    'In_Stock': [True, True, False, True, True],
    'Category': ['Electronics', 'Furniture', 'Electronics', 'Office', 'Furniture']
})

print("Sample DataFrame:")
print(sample_df)
print()

# Basic attributes
print("--- BASIC ATTRIBUTES ---")
print(f"Shape (rows, columns): {sample_df.shape}")
print(f"Number of rows: {sample_df.shape[0]}")
print(f"Number of columns: {sample_df.shape[1]}")
print(f"Columns: {sample_df.columns.tolist()}")
print(f"Index: {sample_df.index.tolist()}")
print(f"\nData types:\n{sample_df.dtypes}")
print(f"\nColumn types:\n{sample_df.dtypes.value_counts()}")

# Methods for inspection
print("\n--- INSPECTION METHODS ---")
print("DataFrame.info():")
sample_df.info()
print("\nDataFrame.describe() (numerical columns):")
print(sample_df.describe())
print("\nDataFrame.describe(include='all'):")
print(sample_df.describe(include='all'))

Sample DataFrame:
  Product  Price  Quantity  In_Stock     Category
0       A    100        10      True  Electronics
1       B    200         5      True    Furniture
2       C    150         8     False  Electronics
3       D    300         3      True       Office
4       E    250         6      True    Furniture

--- BASIC ATTRIBUTES ---
Shape (rows, columns): (5, 5)
Number of rows: 5
Number of columns: 5
Columns: ['Product', 'Price', 'Quantity', 'In_Stock', 'Category']
Index: [0, 1, 2, 3, 4]

Data types:
Product     object
Price        int64
Quantity     int64
In_Stock      bool
Category    object
dtype: object

Column types:
object    2
int64     2
bool      1
Name: count, dtype: int64

--- INSPECTION METHODS ---
DataFrame.info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Product   5 non-null      object
 1   Price     5 non-null      int64 
 

## 10. Setting DataFrame Properties

### Customizing:
- Column names (.columns)
- Index (.index)
- Index name (.index.name)
- Column names (.columns.name)

In [14]:
# Create initial DataFrame
df12 = pd.DataFrame({
    'a': [1, 2, 3, 4],
    'b': [5, 6, 7, 8],
    'c': [9, 10, 11, 12]
})

print("Original DataFrame:")
print(df12)
print()

# Change column names
df12.columns = ['Col1', 'Col2', 'Col3']
print("After renaming columns:")
print(df12)
print()

# Change index
df12.index = ['Row1', 'Row2', 'Row3', 'Row4']
print("After changing index:")
print(df12)
print()

# Set index and column names
df12.index.name = 'Row_ID'
df12.columns.name = 'Columns'
print("With index and column names:")
print(df12)
print()

# Reset index (creates new column with old index)
df_reset = df12.reset_index()
print("After reset_index():")
print(df_reset)

Original DataFrame:
   a  b   c
0  1  5   9
1  2  6  10
2  3  7  11
3  4  8  12

After renaming columns:
   Col1  Col2  Col3
0     1     5     9
1     2     6    10
2     3     7    11
3     4     8    12

After changing index:
      Col1  Col2  Col3
Row1     1     5     9
Row2     2     6    10
Row3     3     7    11
Row4     4     8    12

With index and column names:
Columns  Col1  Col2  Col3
Row_ID                   
Row1        1     5     9
Row2        2     6    10
Row3        3     7    11
Row4        4     8    12

After reset_index():
Columns Row_ID  Col1  Col2  Col3
0         Row1     1     5     9
1         Row2     2     6    10
2         Row3     3     7    11
3         Row4     4     8    12


## 11. Common DataFrame Creation Patterns

### Pattern 1: From existing data structures
```python
# Dictionary → DataFrame (most common)
data_dict = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data_dict)

### Pattern 2: Empty DataFrame with columns

In [15]:
df = pd.DataFrame(columns=['A', 'B', 'C'])
# Append rows later

### Pattern 3: Time series data


In [16]:
dates = pd.date_range('2023-01-01', periods=10)
df = pd.DataFrame(index=dates)
df['Values'] = np.random.randn(10)

### Pattern 4: From function results

In [18]:

# Demonstration of patterns

# Pattern 1: Dictionary to DataFrame (most common)
print("Pattern 1: Dictionary → DataFrame")
data = {
    'Student': ['S1', 'S2', 'S3'],
    'Math': [85, 90, 78],
    'Science': [88, 92, 85]
}
df_pattern1 = pd.DataFrame(data)
print(df_pattern1)
print()

# Pattern 2: Build incrementally
print("Pattern 2: Build incrementally")
df_pattern2 = pd.DataFrame(columns=['Timestamp', 'Value', 'Status'])
# Simulate adding rows
for i in range(3):
    new_row = {'Timestamp': pd.Timestamp.now() + timedelta(hours=i),
               'Value': np.random.rand(),
               'Status': 'Active' if i % 2 == 0 else 'Inactive'}
    df_pattern2 = pd.concat([df_pattern2, pd.DataFrame([new_row])], ignore_index=True)
print(df_pattern2)
print()

# Pattern 3: Time series with calculations
print("Pattern 3: Time series with calculations")
dates = pd.date_range('2023-01-01', periods=5, freq='D')
df_pattern3 = pd.DataFrame({'Date': dates})
df_pattern3['Day'] = df_pattern3['Date'].dt.day_name()
df_pattern3['Sales'] = np.random.randint(100, 500, 5)
df_pattern3['Cumulative_Sales'] = df_pattern3['Sales'].cumsum()
print(df_pattern3)

Pattern 1: Dictionary → DataFrame
  Student  Math  Science
0      S1    85       88
1      S2    90       92
2      S3    78       85

Pattern 2: Build incrementally
                   Timestamp     Value    Status
0 2025-12-25 23:35:40.865855  0.742194    Active
1 2025-12-26 00:35:40.866389  0.072353  Inactive
2 2025-12-26 01:35:40.866652  0.286307    Active

Pattern 3: Time series with calculations
        Date        Day  Sales  Cumulative_Sales
0 2023-01-01     Sunday    248               248
1 2023-01-02     Monday    494               742
2 2023-01-03    Tuesday    436              1178
3 2023-01-04  Wednesday    458              1636
4 2023-01-05   Thursday    328              1964


  df_pattern2 = pd.concat([df_pattern2, pd.DataFrame([new_row])], ignore_index=True)


# Summary: DataFrame Creation Methods

## Key Creation Methods:

### 1. **From Dictionaries**
- Dict of lists: Most common, keys → columns
- Dict of dicts: Outer keys → columns, inner keys → index
- Dict of Series: Each Series becomes a column

### 2. **From Lists**
- List of lists: Each inner list = row
- List of dicts: Each dict = row
- List of Series: Each Series = row

### 3. **From NumPy Arrays**
- 2D array: Direct conversion
- Dict of arrays: Arrays become columns

### 4. **From External Sources**
- CSV: `pd.read_csv()`
- Excel: `pd.read_excel()`
- JSON: `pd.read_json()`

## Important Attributes:
- `.shape`: (rows, columns)
- `.columns`: Column names
- `.index`: Row labels
- `.dtypes`: Data types per column
- `.info()`: Concise summary
- `.describe()`: Statistical summary

## Best Practices:
1. **Dictionary of lists** is most efficient for DataFrame creation
2. Always check `.shape` and `.dtypes` after creation
3. Use `.info()` for quick inspection
4. Set meaningful column names and index during creation
5. Consider memory usage when creating large DataFrames

## Common Pitfalls:
1. Lists in dictionary must have same length
2. Missing values become NaN
3. Mixed data types in a column become 'object' type
4. Index alignment matters in operations

# Practice Exercises

## Exercise 1: Basic DataFrame Creation
1. Create a DataFrame with columns: 'Student_ID', 'Name', 'Grade', 'Attendance'
2. Add 5 rows of sample data
3. Display the DataFrame shape and data types

## Exercise 2: From Different Sources
1. Create the same DataFrame using:
   a) Dictionary of lists
   b) List of dictionaries
   c) List of lists with columns parameter
2. Compare the results

## Exercise 3: Time Series DataFrame
1. Create a DataFrame with date index for the last 7 days
2. Add columns: 'Temperature', 'Humidity', 'Rainfall' with random values
3. Add a column with day names
4. Calculate the average temperature

## Exercise 4: DataFrame Inspection
1. Create a DataFrame with mixed data types (int, float, string, boolean)
2. Use all inspection methods: .info(), .describe(), .shape, .columns, .dtypes
3. Rename the columns and set a custom index

## Useful Functions for DataFrame Creation:
- `pd.DataFrame()`: Primary constructor
- `pd.read_csv()`: Read CSV files
- `pd.read_excel()`: Read Excel files
- `pd.read_json()`: Read JSON files
- `pd.date_range()`: Create date sequences
- `pd.concat()`: Combine DataFrames
