# Pandas meet scikit Exercises



**Objective:**  
This notebook is dedicated to learning the basics of Pandas for data exploration and manipulation. In this exercise, I will:
- Install and import Pandas.
- Explore the core data structures: **Series** and **DataFrame**.
- Create DataFrames from various sources: dictionaries, lists, CSV files, and even scikit‑learn datasets.
- Inspect data using functions like `head()`, `tail()`, and `describe()`.

**Approach (Copilot‑Inspired):**
1. Ask for clear, step‑by‑step instructions aimed at a beginner.
2. Note any ambiguous terminology (e.g., what exactly is a DataFrame, what is an index, etc.).
3. Record follow‑on prompts for further explanations (e.g., "What is the difference between a Series and a DataFrame?").

**Initial Prompts:**
- "Show me how to import pandas and create a DataFrame from a dictionary, with a step‑by‑step explanation for a beginner."
- "Explain what a Pandas Series is and how it differs from a DataFrame."


## Pandas Concepts Covered

I will explore the following concepts:

1. **Installation and Setup:**  
   - Importing the Pandas library.
2. **Understanding Data Structures:**  
   - Overview of Series and DataFrame.
3. **Creating DataFrames:**  
   - From dictionaries and lists.
   - (Optionally) From CSV files and scikit‑learn datasets.
4. **Inspecting Data:**  
   - Using functions such as `head()`, `tail()`, `info()`, and `describe()` to gain insights into the dataset.


In [1]:
import pandas as pd

# Check the version of Pandas
print("Pandas version:", pd.__version__)


Pandas version: 2.2.3


In [2]:
# Create a simple Pandas Series from a list
data_list = [10, 20, 30, 40, 50]
series_example = pd.Series(data_list)
print("Series:\n", series_example)


Series:
 0    10
1    20
2    30
3    40
4    50
dtype: int64


### Follow‑on Prompt:

"Explain the importance and functionality of the index in a Pandas Series."
- Below is a dataframe from a dictionary

In [3]:
# Create a DataFrame from a dictionary
data_dict = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df_example = pd.DataFrame(data_dict)
print("DataFrame:\n", df_example)


DataFrame:
       Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    Diana   28      Houston


In [4]:
# Using the head() and tail() functions to inspect the DataFrame
print("First 3 rows:\n", df_example.head(3))
print("\nLast 2 rows:\n", df_example.tail(2))


First 3 rows:
       Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Last 2 rows:
       Name  Age     City
2  Charlie   35  Chicago
3    Diana   28  Houston


In [5]:
# Using describe() and info() to get summary statistics and information about the DataFrame
print("Summary statistics:\n", df_example.describe())
print("\nDataFrame info:")
df_example.info()


Summary statistics:
              Age
count   4.000000
mean   29.500000
std     4.203173
min    25.000000
25%    27.250000
50%    29.000000
75%    31.250000
max    35.000000

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   City    4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes


## Learning Outcomes 

- **Installation and Setup:**
  - Imported pandas and verified its version.
- **Data Structures:**
  - Learned that a **Series** is a one-dimensional array with an automatic index.
  - Learned that a **DataFrame** is a two-dimensional table with rows and columns.
- **Creating DataFrames:**
  - Created a DataFrame from a dictionary.
  - (Optionally) Explored creating a DataFrame by reading a CSV file.
- **Inspecting Data:**
  - Used `head()` and `tail()` to view subsets of data.
  - Used `describe()` and `info()` to retrieve summary statistics and dataset information.

**Ambiguous Terms/Further Questions:**
- What exactly is an **index** in a Series or DataFrame, and how can it be customized?
- What are the main differences between a Series and a DataFrame beyond dimensionality?
- How can missing data be detected and handled effectively in a DataFrame?
- What additional benefits do methods like `info()` provide compared to `describe()`?


