# Series and DataFrames

## Learning Objectives

By the end of this notebook, you will be able to:

1. Create Pandas Series from various data sources
2. Create DataFrames from dictionaries, lists, and other structures
3. Access basic attributes of Series and DataFrames
4. View and inspect data using head(), tail(), info(), and describe()
5. Understand the relationship between Series and DataFrames

---

## 1. Introduction to Pandas

Pandas is Python's most popular data manipulation library. It provides two main data structures:

- **Series**: A one-dimensional labeled array
- **DataFrame**: A two-dimensional labeled data structure (like a spreadsheet or SQL table)

Let's start by importing pandas:

In [None]:
import pandas as pd
import numpy as np

# Check the version
print(f"Pandas version: {pd.__version__}")

---

## 2. Pandas Series

A Series is a one-dimensional array with labels (called an index). Think of it as a single column of data.

### 2.1 Creating Series from a List

In [None]:
# Create a Series from a list
temperatures = pd.Series([72, 68, 75, 80, 77])
print("Temperature Series:")
print(temperatures)
print(f"\nType: {type(temperatures)}")

In [None]:
# Create a Series with a custom index
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
temperatures = pd.Series([72, 68, 75, 80, 77], index=days)
print("Temperature Series with custom index:")
print(temperatures)

### 2.2 Creating Series from a Dictionary

In [None]:
# Create a Series from a dictionary
population = pd.Series({
    'New York': 8336817,
    'Los Angeles': 3979576,
    'Chicago': 2693976,
    'Houston': 2320268,
    'Phoenix': 1680992
})
print("City Population Series:")
print(population)

### 2.3 Series Attributes

In [None]:
# Key attributes of a Series
print(f"Values: {population.values}")
print(f"\nIndex: {population.index}")
print(f"\nData type: {population.dtype}")
print(f"\nShape: {population.shape}")
print(f"\nSize: {population.size}")
print(f"\nName: {population.name}")

In [None]:
# Give the Series a name
population.name = 'Population'
population.index.name = 'City'
print(population)

### 2.4 Accessing Series Elements

In [None]:
# Access by label
print(f"New York population: {population['New York']:,}")

# Access by position
print(f"First city population: {population.iloc[0]:,}")

# Access multiple elements
print(f"\nTop 3 cities:\n{population[:3]}")

---

## 3. Pandas DataFrames

A DataFrame is a two-dimensional data structure with labeled rows and columns. Think of it as a dictionary of Series sharing the same index.

### 3.1 Creating DataFrames from a Dictionary

In [None]:
# Create a DataFrame from a dictionary of lists
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'Salary': [70000, 80000, 90000, 75000, 85000]
}

df = pd.DataFrame(data)
print("Employee DataFrame:")
print(df)

In [None]:
# Create a DataFrame with a custom index
df = pd.DataFrame(data, index=['E001', 'E002', 'E003', 'E004', 'E005'])
df.index.name = 'Employee_ID'
print("Employee DataFrame with custom index:")
print(df)

### 3.2 Creating DataFrames from a List of Dictionaries

In [None]:
# Create a DataFrame from a list of dictionaries (row-oriented)
products = [
    {'product': 'Laptop', 'price': 999.99, 'quantity': 50},
    {'product': 'Mouse', 'price': 29.99, 'quantity': 200},
    {'product': 'Keyboard', 'price': 79.99, 'quantity': 150},
    {'product': 'Monitor', 'price': 299.99, 'quantity': 75}
]

products_df = pd.DataFrame(products)
print("Products DataFrame:")
print(products_df)

### 3.3 Creating DataFrames from NumPy Arrays

In [None]:
# Create a DataFrame from a NumPy array
np.random.seed(42)
random_data = np.random.randint(1, 100, size=(5, 4))

scores_df = pd.DataFrame(
    random_data,
    columns=['Math', 'Science', 'English', 'History'],
    index=['Student_1', 'Student_2', 'Student_3', 'Student_4', 'Student_5']
)
print("Student Scores DataFrame:")
print(scores_df)

---

## 4. DataFrame Attributes

In [None]:
# Let's use our employee DataFrame
print("DataFrame attributes:")
print(f"Shape: {df.shape}")
print(f"Size: {df.size}")
print(f"Dimensions: {df.ndim}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nIndex: {df.index.tolist()}")
print(f"\nData types:\n{df.dtypes}")

In [None]:
# Access the underlying values as a NumPy array
print("Values (as NumPy array):")
print(df.values)

---

## 5. Viewing Data

### 5.1 head() and tail()

In [None]:
# Create a larger DataFrame for demonstration
np.random.seed(42)
large_df = pd.DataFrame({
    'A': np.random.randn(100),
    'B': np.random.randn(100),
    'C': np.random.randint(1, 100, 100),
    'D': np.random.choice(['X', 'Y', 'Z'], 100)
})

print("First 5 rows (head):")
print(large_df.head())

In [None]:
# View last 5 rows
print("Last 5 rows (tail):")
print(large_df.tail())

In [None]:
# View specific number of rows
print("First 3 rows:")
print(large_df.head(3))

### 5.2 info() - DataFrame Summary

In [None]:
# Get a concise summary of the DataFrame
print("DataFrame Info:")
large_df.info()

### 5.3 describe() - Statistical Summary

In [None]:
# Get statistical summary of numeric columns
print("Statistical Summary:")
print(large_df.describe())

In [None]:
# Include all columns (including non-numeric)
print("Full Statistical Summary:")
print(large_df.describe(include='all'))

### 5.4 sample() - Random Sampling

In [None]:
# Get random samples from the DataFrame
print("5 random rows:")
print(large_df.sample(5))

---

## 6. Relationship Between Series and DataFrames

In [None]:
# A DataFrame column is a Series
name_column = df['Name']
print(f"Type of column: {type(name_column)}")
print(f"\nName column:\n{name_column}")

In [None]:
# A DataFrame row is also a Series
first_row = df.iloc[0]
print(f"Type of row: {type(first_row)}")
print(f"\nFirst row:\n{first_row}")

In [None]:
# Create a DataFrame from multiple Series
names = pd.Series(['Alice', 'Bob', 'Charlie'], name='Name')
ages = pd.Series([25, 30, 35], name='Age')
cities = pd.Series(['NYC', 'LA', 'Chicago'], name='City')

combined_df = pd.concat([names, ages, cities], axis=1)
print("DataFrame from Series:")
print(combined_df)

---

## Exercises

Now it's your turn to practice!

### Exercise 1: Create a Series

Create a Series called `stock_prices` containing the following stock prices:
- AAPL: 150.25
- GOOGL: 2750.80
- MSFT: 305.50
- AMZN: 3380.00
- TSLA: 725.60

Then print the Series and its data type.

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
stock_prices = pd.Series({
    'AAPL': 150.25,
    'GOOGL': 2750.80,
    'MSFT': 305.50,
    'AMZN': 3380.00,
    'TSLA': 725.60
})
stock_prices.name = 'Stock Price'
stock_prices.index.name = 'Ticker'

print(stock_prices)
print(f"\nData type: {stock_prices.dtype}")
```
</details>

### Exercise 2: Create a DataFrame

Create a DataFrame called `books_df` with the following information about books:

| Title | Author | Year | Pages | Price |
|-------|--------|------|-------|-------|
| Python Crash Course | Eric Matthes | 2019 | 544 | 35.99 |
| Fluent Python | Luciano Ramalho | 2022 | 1012 | 65.00 |
| Learning Python | Mark Lutz | 2013 | 1648 | 60.00 |
| Effective Python | Brett Slatkin | 2019 | 480 | 45.00 |

Print the DataFrame and its shape.

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
books_df = pd.DataFrame({
    'Title': ['Python Crash Course', 'Fluent Python', 'Learning Python', 'Effective Python'],
    'Author': ['Eric Matthes', 'Luciano Ramalho', 'Mark Lutz', 'Brett Slatkin'],
    'Year': [2019, 2022, 2013, 2019],
    'Pages': [544, 1012, 1648, 480],
    'Price': [35.99, 65.00, 60.00, 45.00]
})

print(books_df)
print(f"\nShape: {books_df.shape}")
```
</details>

### Exercise 3: Explore DataFrame Attributes

Using the `books_df` DataFrame from Exercise 2:
1. Print all column names
2. Print the data types of each column
3. Use `info()` to get a summary
4. Use `describe()` to get statistics for numeric columns

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
# Recreate books_df if needed
books_df = pd.DataFrame({
    'Title': ['Python Crash Course', 'Fluent Python', 'Learning Python', 'Effective Python'],
    'Author': ['Eric Matthes', 'Luciano Ramalho', 'Mark Lutz', 'Brett Slatkin'],
    'Year': [2019, 2022, 2013, 2019],
    'Pages': [544, 1012, 1648, 480],
    'Price': [35.99, 65.00, 60.00, 45.00]
})

# 1. Column names
print("Column names:")
print(books_df.columns.tolist())

# 2. Data types
print("\nData types:")
print(books_df.dtypes)

# 3. Info
print("\nDataFrame Info:")
books_df.info()

# 4. Describe
print("\nStatistical Summary:")
print(books_df.describe())
```
</details>

### Exercise 4: Create a DataFrame from a List of Dictionaries

Create a DataFrame called `movies_df` from the following list of dictionaries:

```python
movies = [
    {'title': 'Inception', 'year': 2010, 'rating': 8.8, 'director': 'Christopher Nolan'},
    {'title': 'The Matrix', 'year': 1999, 'rating': 8.7, 'director': 'Wachowskis'},
    {'title': 'Interstellar', 'year': 2014, 'rating': 8.6, 'director': 'Christopher Nolan'},
    {'title': 'Pulp Fiction', 'year': 1994, 'rating': 8.9, 'director': 'Quentin Tarantino'}
]
```

Print the first 2 rows and the last 2 rows.

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
movies = [
    {'title': 'Inception', 'year': 2010, 'rating': 8.8, 'director': 'Christopher Nolan'},
    {'title': 'The Matrix', 'year': 1999, 'rating': 8.7, 'director': 'Wachowskis'},
    {'title': 'Interstellar', 'year': 2014, 'rating': 8.6, 'director': 'Christopher Nolan'},
    {'title': 'Pulp Fiction', 'year': 1994, 'rating': 8.9, 'director': 'Quentin Tarantino'}
]

movies_df = pd.DataFrame(movies)

print("First 2 rows:")
print(movies_df.head(2))

print("\nLast 2 rows:")
print(movies_df.tail(2))
```
</details>

### Exercise 5: Extract Series from DataFrame

Using the `movies_df` DataFrame:
1. Extract the 'rating' column as a Series
2. Print the type to confirm it's a Series
3. Calculate the mean, min, and max rating

In [None]:
# Your code here


<details>
<summary>Click to reveal solution</summary>

```python
# Recreate movies_df if needed
movies = [
    {'title': 'Inception', 'year': 2010, 'rating': 8.8, 'director': 'Christopher Nolan'},
    {'title': 'The Matrix', 'year': 1999, 'rating': 8.7, 'director': 'Wachowskis'},
    {'title': 'Interstellar', 'year': 2014, 'rating': 8.6, 'director': 'Christopher Nolan'},
    {'title': 'Pulp Fiction', 'year': 1994, 'rating': 8.9, 'director': 'Quentin Tarantino'}
]
movies_df = pd.DataFrame(movies)

# 1. Extract rating column
ratings = movies_df['rating']

# 2. Print type
print(f"Type: {type(ratings)}")
print(f"\nRatings Series:\n{ratings}")

# 3. Calculate statistics
print(f"\nMean rating: {ratings.mean():.2f}")
print(f"Min rating: {ratings.min()}")
print(f"Max rating: {ratings.max()}")
```
</details>

---

## Summary

In this notebook, you learned:

1. **Series**: One-dimensional labeled arrays
   - Created from lists, dictionaries, or NumPy arrays
   - Have index, values, dtype, shape, and name attributes

2. **DataFrames**: Two-dimensional labeled data structures
   - Created from dictionaries, lists of dictionaries, or NumPy arrays
   - Each column is a Series
   - Each row can be accessed as a Series

3. **Viewing Data**:
   - `head()` and `tail()` for first/last rows
   - `info()` for DataFrame summary
   - `describe()` for statistical summary
   - `sample()` for random rows

4. **Key Attributes**:
   - `shape`, `size`, `ndim` for dimensions
   - `columns`, `index` for labels
   - `dtypes` for data types
   - `values` for underlying NumPy array

---

## Next Steps

Continue to the next notebook: **[02_reading_writing_data.ipynb](02_reading_writing_data.ipynb)** to learn how to read and write data from various file formats.