# Pandas Fundamentals I - Part 1: Introduction and DataFrame Creation

## Week 2, Day 2 (Thursday) - April 17th, 2025

### Overview
This is the first part of our introduction to Pandas, focusing on core concepts and DataFrame creation. We'll explore how Pandas relates to SQL and learn how to create and load DataFrames from various sources.

### Learning Objectives
- Understand what Pandas is and why it's essential for data analysis
- Compare Pandas DataFrames to SQL tables
- Create DataFrames from different data structures
- Load data from external files

### Prerequisites
- Python fundamentals (Week 1)
- NumPy basics (Week 2, Day 1)
- SQL knowledge (prior to course)

## 1. Introduction to Pandas

### What is Pandas?

Pandas is a powerful, open-source Python library for data manipulation and analysis. The name comes from "panel data," a term for multidimensional structured data sets in econometrics.

Key features of Pandas include:
- Fast, efficient manipulation of large data sets
- Tools for reading and writing data between in-memory data structures and different file formats
- Smart data alignment and handling of missing data
- Reshaping and pivoting of data sets
- Label-based slicing, fancy indexing, and subsetting of large data sets
- Data aggregation and transformation
- High-performance merging and joining of data sets

In [None]:
# Import pandas
import pandas as pd
import numpy as np

# Check pandas version
print(f"Pandas version: {pd.__version__}")

### Why Pandas for Data Analysis?

If you're coming from a SQL background, you might wonder why we need Pandas when we already have powerful database systems. Here are some advantages of using Pandas for data analysis:

1. **Flexibility**: Pandas works with various data formats, not just databases.
2. **Interactivity**: Immediate feedback during exploratory data analysis.
3. **Integration**: Works seamlessly with other Python libraries for visualization and modeling.
4. **Functionality**: Combines the best of SQL, Excel, and programming in one place.
5. **Performance**: Optimized C code under the hood for efficient data processing.

## 2. Pandas vs. SQL: Key Similarities and Differences

Since you already know SQL, let's compare some SQL operations with their Pandas equivalents:

| SQL Operation | Pandas Equivalent |
|--------------|-------------------|
| `SELECT * FROM table` | `df` |
| `SELECT col1, col2 FROM table` | `df[['col1', 'col2']]` |
| `SELECT * FROM table WHERE col > value` | `df[df['col'] > value]` |
| `SELECT col, COUNT(*) FROM table GROUP BY col` | `df.groupby('col').size()` |
| `SELECT * FROM table ORDER BY col` | `df.sort_values('col')` |
| `SELECT * FROM table LIMIT 5` | `df.head(5)` |
| `SELECT t1.col, t2.col FROM t1 JOIN t2 ON t1.id = t2.id` | `pd.merge(df1, df2, on='id')` |

We'll explore these operations in more detail throughout the course, but this gives you a preview of how your SQL knowledge will transfer to Pandas.

## 3. Series Objects

Before diving into DataFrames, let's understand the Series object, which is a one-dimensional labeled array in Pandas.

In [None]:
# Creating a simple Series
s = pd.Series([1, 3, 5, 7, 9])
print(s)

In [None]:
# Series with custom index
s = pd.Series([1, 3, 5, 7, 9], index=['a', 'b', 'c', 'd', 'e'])
print(s)

In [None]:
# Creating a Series from a dictionary
population = {
    'New York': 8.4,
    'Los Angeles': 3.9,
    'Chicago': 2.7,
    'Houston': 2.3,
    'Phoenix': 1.6
}

city_pop = pd.Series(population)
print(city_pop)

In [None]:
# Series operations
print("Values above 3 million:")
print(city_pop[city_pop > 3])

print("\nMultiply all populations by 1 million:")
print(city_pop * 1000000)

Key characteristics of Series:
- Similar to a 1D NumPy array or a Python dictionary
- Has an index (like row labels in a database table)
- Supports vectorized operations
- Can handle missing data
- Can be thought of as a single column in a SQL table or a single column in a Pandas DataFrame

## 4. DataFrame Objects

A DataFrame is a 2D labeled data structure with columns of potentially different types. You can think of it as a spreadsheet or SQL table. It's the most commonly used Pandas object.

In [None]:
# Creating a simple DataFrame from a dictionary of Series
data = {
    'population': pd.Series([8.4, 3.9, 2.7, 2.3, 1.6], index=['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']),
    'area': pd.Series([468.9, 502.7, 227.3, 637.5, 517.9], index=['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']),
    'state': pd.Series(['NY', 'CA', 'IL', 'TX', 'AZ'], index=['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'])
}

cities_df = pd.DataFrame(data)
print(cities_df)

In [None]:
# Calculate population density (population per square mile)
cities_df['density'] = cities_df['population'] / cities_df['area'] * 1000  # Converting to thousands per square mile
print(cities_df)

Key characteristics of DataFrames:
- Similar to a SQL table or an Excel spreadsheet
- Has row labels (index) and column labels
- Can contain columns of different data types
- Supports SQL-like operations such as filtering, grouping, and joining
- Provides powerful data manipulation capabilities

## 5. Creating DataFrames from Python Objects

Let's explore different ways to create DataFrames from Python objects.

In [None]:
# From a dictionary of lists or arrays
data = {
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'product_name': ['Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Monitor'],
    'category': ['Electronics', 'Electronics', 'Electronics', 'Accessories', 'Electronics'],
    'price': [1200, 800, 450, 150, 300],
    'in_stock': [True, True, False, True, True]
}

products_df = pd.DataFrame(data)
print(products_df)

In [None]:
# From a list of dictionaries (each dict = one row)
data = [
    {'order_id': 'O001', 'customer_id': 'C001', 'amount': 1250, 'date': '2025-01-05'},
    {'order_id': 'O002', 'customer_id': 'C002', 'amount': 800, 'date': '2025-01-06'},
    {'order_id': 'O003', 'customer_id': 'C001', 'amount': 150, 'date': '2025-01-09'},
    {'order_id': 'O004', 'customer_id': 'C003', 'amount': 450, 'date': '2025-01-10'},
    {'order_id': 'O005', 'customer_id': 'C002', 'amount': 300, 'date': '2025-01-15'}
]

orders_df = pd.DataFrame(data)
print(orders_df)

In [None]:
# From a NumPy array
data = np.random.randint(0, 100, size=(5, 3))
columns = ['A', 'B', 'C']
index = ['Row1', 'Row2', 'Row3', 'Row4', 'Row5']

df_from_array = pd.DataFrame(data, index=index, columns=columns)
print(df_from_array)

## 6. Reading Data from External Files

In real-world data analysis, you'll often read data from files rather than creating DataFrames manually. Pandas provides functions to read data from various file formats.

In [None]:
# Reading from CSV files
# Note: Adjust the file path if needed
df_csv = pd.read_csv('../Data/numeric_data.csv')
print(df_csv.head())

In [None]:
# Reading CSV with options
df_csv_options = pd.read_csv('../Data/numeric_data.csv', 
                            index_col='id',  # Use 'id' column as index
                            parse_dates=['date'],  # Parse date column as datetime
                            nrows=3)  # Read only first 3 rows
print(df_csv_options)

### Other File Formats

Pandas can read data from many other file formats, including:

- Excel files: `pd.read_excel('file.xlsx')`
- JSON: `pd.read_json('file.json')`
- SQL databases: 
  ```python
  from sqlalchemy import create_engine
  engine = create_engine('sqlite:///database.db')
  df = pd.read_sql('SELECT * FROM table', engine)
  ```
- HTML tables: `pd.read_html('http://example.com/table.html')`
- Stata, SAS, and SPSS files: `pd.read_stata()`, `pd.read_sas()`, etc.
- Parquet and other columnar formats: `pd.read_parquet()`

### Reading data from SQL databases

Since you already know SQL, this might be particularly interesting. Here's an example of how you might read data from a SQL database:

In [None]:
# Note: This code is for demonstration only; we're not connecting to a real database in this notebook
'''
from sqlalchemy import create_engine

# Create a connection to the database
# Replace with your actual database connection string
engine = create_engine('sqlite:///olist.db')  

# Read data using a SQL query
query = """
SELECT o.order_id, c.customer_id, o.order_status, o.order_purchase_timestamp
FROM olist_orders_dataset o
JOIN olist_customers_dataset c ON o.customer_id = c.customer_id
LIMIT 5
"""

df_sql = pd.read_sql(query, engine)
print(df_sql)
'''

## 7. Practice Exercises

Now let's practice creating and manipulating DataFrames with some exercises.

### Exercise 1: Create a DataFrame from a dictionary

Create a DataFrame that represents customer data for an e-commerce website with the following columns:
- customer_id
- name
- email
- signup_date
- is_premium

Include at least 5 customers in your DataFrame.

In [None]:
# Your code here


### Exercise 2: Read and explore sample data

Read the sample CSV file we used earlier and answer the following questions:
1. How many rows and columns are in the dataset?
2. What are the column names?
3. What is the average value of 'value1'?
4. How many records belong to each category?

In [None]:
# Your code here


### Exercise 3: SQL to Pandas translation

Translate the following SQL query into Pandas code using the products_df DataFrame we created earlier:

```sql
SELECT product_name, price
FROM products
WHERE category = 'Electronics' AND price > 500
ORDER BY price DESC
```

In [None]:
# Your code here


## Next Steps

In the next part, we'll dive deeper into basic DataFrame operations, including data inspection, column and row operations, and handling missing data.

Continue to [Part 2: Basic DataFrame Operations](02_Pandas_Fundamentals_I_part2.ipynb)