# Pandas-1

In [None]:
# Pandas uses numpy internally. The numpy, as we learned, is for scientific computing. 

# Pandas is open source library, used for data manipulation and analysis.



# Why Pandas? While Python lists, dictionaries, and NumPy arrays are excellent foundational tools,
# ==================================================================================================================== 
# Pandas builds upon them to provide higher-level, more structured, and more convenient data manipulation capabilities, 
# especially for tabular (spreadsheet-like) data


'''

Here's a breakdown of why Pandas is preferred:

1. Handling Tabular Data Naturally (DataFrames and Series)
Python Lists/Dicts: Great for general-purpose collections. Lists are ordered sequences, and dictionaries are key-value pairs. They don't inherently represent columns and rows with labels in a consistent way for analysis.

NumPy Arrays: Fantastic for numerical operations on homogeneous data. They are powerful matrices. However:

They are primarily for numerical data (homogeneous dtype). If you have mixed data types (numbers, strings, dates), NumPy arrays struggle or force dtype=object, losing performance benefits.
They lack named columns and rows (labels). You refer to data by integer indices (array[0, 1]). This makes code harder to read, maintain, and debug when dealing with complex datasets.
They don't directly handle missing data (NaN values) as gracefully for mixed types.
Pandas (DataFrame & Series):

DataFrame: The core Pandas object, designed explicitly for tabular data. It's essentially a collection of Series (columns) that share a common index (rows). It has: 

Labeled Axes: Both rows (index) and columns have names, making data selection and understanding intuitive (df['column_name'], df.loc['row_label']).
Heterogeneous Data Types: Each column can have its own dtype (e.g., one column of integers, another of strings, a third of booleans), while still being highly efficient.
Built-in Missing Data Handling: NaN (Not a Number) is natively supported and Pandas provides powerful methods to fillna(), dropna(), etc.
2. Convenience and Expressiveness
Python/NumPy: Often requires more verbose code or multiple steps for common data tasks. For example, filtering a NumPy array based on conditions across multiple columns, or joining data from different sources.
Pandas: Provides highly expressive, concise, and intuitive methods for common data operations:
Filtering/Subsetting: df[df['age'] > 30] is much more readable than equivalent NumPy boolean indexing on multiple arrays.
Grouping and Aggregation (.groupby()): Incredibly powerful for summarizing data by categories (e.g., "average sales per region"). This is complex and verbose with just NumPy.
Merging/Joining (.merge(), .join()): SQL-like operations to combine DataFrames based on common columns, essential for integrating data from various sources.
Reshaping (.pivot_table(), .stack(), .unstack()): Easily transform the layout of your data for different analytical needs.
Time Series Functionality: Robust features for handling dates and times, resampling, rolling calculations, etc., which are crucial for time-series analysis.
3. Missing Data Handling
NumPy: NaN (Not a Number) is the standard for missing numerical data. However, for non-numeric data types, NumPy arrays might fallback to dtype=object which can be less efficient and harder to work with. Operations involving NaN in NumPy require explicit handling.
Pandas: Integrates NaN seamlessly across all data types (using object dtype for strings with NaNs, Nullable Dtype for integers, etc.). It provides dedicated methods like .isnull(), .notnull(), .dropna(), .fillna(), and .interpolate() that make dealing with missing values much simpler and more robust.
4. Integration with Data Sources
Python/NumPy: You'd typically need to write custom code or use other libraries to read CSV, Excel, SQL databases, JSON, etc., and then convert them into arrays/lists.
Pandas: Built-in I/O tools make reading and writing various data formats trivial (pd.read_csv(), pd.read_excel(), df.to_sql(), df.to_json(), etc.). This dramatically speeds up the data loading and saving process.
5. Performance (Built on NumPy)
Crucially, Pandas is built on top of NumPy. This means that underneath its user-friendly interface, many Pandas operations leverage NumPy's optimized, vectorized C implementations. So, you get the best of both worlds: high-level convenience and low-level performance.

When to Use What:
#==============================#
Basic Python (lists, dicts, tuples): For general-purpose programming, small, unstructured collections, or when you need highly custom data structures.
NumPy Arrays: For pure numerical computation, especially when dealing with large, homogeneous, multi-dimensional arrays (like images, scientific simulations, numerical linear algebra). If you're doing complex math on number grids, NumPy is your direct tool.
Pandas DataFrames/Series: For most data analysis tasks, especially with tabular data that is mixed-type, has labels, or contains missing values. It's the go-to for data cleaning, exploration, transformation, and preparation before feeding it into machine learning models.
In summary, while you could theoretically do everything with just Python and NumPy, Pandas provides a specialized, highly optimized, and incredibly convenient framework that abstracts away much of the complexity, making data analysis much faster, easier, and more enjoyable for real-world datasets.


'''

In [7]:
# install pandas and import
!pip install pandas
import pandas as pd

Collecting pandas
  Downloading pandas-2.3.0-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.0-cp312-cp312-win_amd64.whl (11.0 MB)
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   ------- -------------------------------- 2.1/11.0 MB 16.8 MB/s eta 0:00:01
   --------------------- ------------------ 5.8/11.0 MB 17.6 MB/s eta 0:00:01
   ---------------------------------- ----- 9.4/11.0 MB 17.3 MB/s eta 0:00:01
   ---------------------------------------- 11.0/11.0 MB 16.7 MB/s eta 0:00:00
Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.3.0 pytz-2025.2 tzdata-2025.2


In [8]:
print(f"Pandas version: {pd.__version__}")

Pandas version: 2.3.0


In [9]:
import pandas as pd

# Define the scores for s1
scores_s1 = [85, 92, 78, 65, 90]

# Create the Series s1
s1 = pd.Series(scores_s1)

print("--- Series s1 ---")
print(s1)
print(f"\nType of s1: {type(s1)}")
print(f"Index of s1: {s1.index}")
print(f"Values of s1: {s1.values}")

--- Series s1 ---
0    85
1    92
2    78
3    65
4    90
dtype: int64

Type of s1: <class 'pandas.core.series.Series'>
Index of s1: RangeIndex(start=0, stop=5, step=1)
Values of s1: [85 92 78 65 90]


In [12]:
import pandas as pd

# Define the scores for s2
scores_s2 = [75, 88, 95, 80]

# Define the custom index labels (names)
names_index = ['Alice', 'Bob', 'Charlie', 'David']

# Create the Series s2 with a custom index
s2 = pd.Series(scores_s2, index=names_index, name ="scores")

print("\n--- Series s2 ---")
print(s2)
print(f"\nType of s2: {type(s2)}")
print(f"Index of s2: {s2.index}")
print(f"Values of s2: {s2.values}")


--- Series s2 ---
Alice      75
Bob        88
Charlie    95
David      80
Name: scores, dtype: int64

Type of s2: <class 'pandas.core.series.Series'>
Index of s2: Index(['Alice', 'Bob', 'Charlie', 'David'], dtype='object')
Values of s2: [75 88 95 80]


In [13]:
# Creating series from dictionary

'''
Creating a Pandas Series from a Python dictionary is a very common and convenient way to initialize a Series,
especially when you want specific labels (keys from the dictionary) to be directly associated with values.

When you create a Series from a dictionary:

The keys of the dictionary become the index labels of the Series.
The values of the dictionary become the data values in the Series.
Here's an example:
'''

import pandas as pd

# Define a Python dictionary
# Keys will be the index labels, values will be the data
data_dict = {
    'Math': 95,
    'Science': 88,
    'History': 72,
    'Art': 91,
    'English': 85
}

# Create a Pandas Series from the dictionary
s_from_dict = pd.Series(data_dict)

print("--- Series created from a Dictionary ---")
print(s_from_dict)
print(f"\nType of s_from_dict: {type(s_from_dict)}")
print(f"Index of s_from_dict: {s_from_dict.index}")
print(f"Values of s_from_dict: {s_from_dict.values}")
print(f"Data type (dtype) of s_from_dict: {s_from_dict.dtype}")

--- Series created from a Dictionary ---
Math       95
Science    88
History    72
Art        91
English    85
dtype: int64

Type of s_from_dict: <class 'pandas.core.series.Series'>
Index of s_from_dict: Index(['Math', 'Science', 'History', 'Art', 'English'], dtype='object')
Values of s_from_dict: [95 88 72 91 85]
Data type (dtype) of s_from_dict: int64


# Series Indexing and Slicing

In [14]:
'''
Series created from a dictionary with int64 values and explore how to perform slicing and indexing on it.

When you create a Series from a dictionary, the index becomes the dictionary's keys, 
and you can then use these labels for indexing, as well as integer-based positional indexing 
(though it's generally best to stick to label-based if you have explicit labels to avoid ambiguity).

'''

import pandas as pd

# Create a Series from a dictionary with int64 values
# Pandas will automatically infer the dtype as int64 because all values are integers.
exam_scores = pd.Series({
    'Alice': 85,
    'Bob': 92,
    'Charlie': 78,
    'David': 65,
    'Eve': 90,
    'Frank': 72,
    'Grace': 88
})

print("--- Original Series (exam_scores) ---")
print(exam_scores)
print(f"Data type (dtype): {exam_scores.dtype}") # Should be int64
print("-" * 30)

# --- 1. Indexing by Label ---
print("\n--- Indexing by Label ---")

# Accessing a single element by its label
print(f"Score for Alice: {exam_scores['Alice']}")

# Accessing multiple elements by a list of labels (Fancy Indexing)
selected_students = exam_scores[['Bob', 'Eve', 'Grace']]
print("\nScores for Bob, Eve, Grace:\n", selected_students)

# Attempting to access a non-existent label will raise a KeyError
try:
    print(exam_scores['Zoe'])
except KeyError as e:
    print(f"\nError accessing non-existent label: {e}")

# --- 2. Slicing by Label ---
# Slicing with labels is INCLUSIVE of the end label!
print("\n--- Slicing by Label ---")

# Scores from Charlie to Eve (inclusive)
scores_slice_label = exam_scores['Charlie':'Eve']
print("\nScores from Charlie to Eve (inclusive):\n", scores_slice_label)

# Slicing from a specific label to the end
scores_from_david = exam_scores['David':]
print("\nScores from David to end:\n", scores_from_david)

# Slicing from the beginning to a specific label
scores_to_charlie = exam_scores[:'Charlie']
print("\nScores from beginning to Charlie:\n", scores_to_charlie)


# --- 3. Indexing by Position (Integer Location) ---
# Use .iloc[] for purely integer-location based indexing
print("\n--- Indexing by Position (.iloc[]) ---")

# Accessing a single element by its integer position (0-indexed)
print(f"Score at position 0: {exam_scores.iloc[0]}") # Alice's score

# Accessing multiple elements by a list of integer positions
selected_by_pos = exam_scores.iloc[[1, 4, 6]] # Bob, Eve, Grace
print("\nScores at positions 1, 4, 6:\n", selected_by_pos)

# Attempting to access an out-of-bounds position will raise an IndexError
try:
    print(exam_scores.iloc[10])
except IndexError as e:
    print(f"\nError accessing out-of-bounds position: {e}")

# --- 4. Slicing by Position (Integer Location) ---
# Use .iloc[] for purely integer-location based slicing
# Slicing with .iloc[] is EXCLUSIVE of the end position, just like Python lists
print("\n--- Slicing by Position (.iloc[]) ---")

# Scores from position 1 to 4 (exclusive of 4) -> index 1, 2, 3 (Bob, Charlie, David)
scores_slice_pos = exam_scores.iloc[1:4]
print("\nScores from position 1 to 4 (exclusive of 4):\n", scores_slice_pos)

# Scores from position 2 to end
scores_from_pos_2 = exam_scores.iloc[2:]
print("\nScores from position 2 to end:\n", scores_from_pos_2)

# Scores from beginning to position 3 (exclusive of 3)
scores_to_pos_3 = exam_scores.iloc[:3]
print("\nScores from beginning to position 3 (exclusive of 3):\n", scores_to_pos_3)


--- Original Series (exam_scores) ---
Alice      85
Bob        92
Charlie    78
David      65
Eve        90
Frank      72
Grace      88
dtype: int64
Data type (dtype): int64
------------------------------

--- Indexing by Label ---
Score for Alice: 85

Scores for Bob, Eve, Grace:
 Bob      92
Eve      90
Grace    88
dtype: int64

Error accessing non-existent label: 'Zoe'

--- Slicing by Label ---

Scores from Charlie to Eve (inclusive):
 Charlie    78
David      65
Eve        90
dtype: int64

Scores from David to end:
 David    65
Eve      90
Frank    72
Grace    88
dtype: int64

Scores from beginning to Charlie:
 Alice      85
Bob        92
Charlie    78
dtype: int64

--- Indexing by Position (.iloc[]) ---
Score at position 0: 85

Scores at positions 1, 4, 6:
 Bob      92
Eve      90
Grace    88
dtype: int64

Error accessing out-of-bounds position: single positional indexer is out-of-bounds

--- Slicing by Position (.iloc[]) ---

Scores from position 1 to 4 (exclusive of 4):
 Bob     

In [None]:
'''
Key Points to Remember:

Label-based indexing (exam_scores['label'] or exam_scores[['label1', 'label2']]): Best when you know the specific labels you want.
Label-based slicing (exam_scores['start_label':'end_label']): Inclusive of the end_label.
Positional indexing (exam_scores.iloc[position] or exam_scores.iloc[[pos1, pos2]]): Best when you need to access by integer location, similar to Python lists.
Positional slicing (exam_scores.iloc[start_pos:end_pos]): Exclusive of the end_pos, just like Python list slicing.
Using the appropriate indexing method (.loc for labels, .iloc for positions) improves clarity and avoids potential ambiguity, especially when your index labels happen to be integers themselves.

'''

"\nKey Points to Remember:\n\nLabel-based indexing (exam_scores['label'] or exam_scores[['label1', 'label2']]): Best when you know the specific labels you want.\nLabel-based slicing (exam_scores['start_label':'end_label']): Inclusive of the end_label.\nPositional indexing (exam_scores.iloc[position] or exam_scores.iloc[[pos1, pos2]]): Best when you need to access by integer location, similar to Python lists.\nPositional slicing (exam_scores.iloc[start_pos:end_pos]): Exclusive of the end_pos, just like Python list slicing.\nUsing the appropriate indexing method (.loc for labels, .iloc for positions) improves clarity and avoids potential ambiguity, especially when your index labels happen to be integers themselves.\n"

# Difference between loc and iloc

In [None]:
'''
In Pandas, loc and iloc are two of the most fundamental and frequently used indexers for selecting data from Series and DataFrames.
They are distinct in how they interpret the "selection" you provide:

loc is primarily label-based: It selects data by the labels of rows and columns.
iloc is primarily integer-location based: It selects data by the position (integer index) of rows and columns.
Let's break them down with examples using both a Series and a DataFrame.

'''

In [None]:
'''
Pandas Series: loc vs. iloc
#===================================

We'll start with a Series where the index labels are distinct from their integer positions to highlight the difference.

'''
import pandas as pd

# Create a Series with custom, non-numeric labels
s = pd.Series([100, 200, 300, 400, 500],
              index=['apple', 'banana', 'cherry', 'date', 'elderberry'])

print("--- Original Series 's' ---")
print(s)
print("-" * 30)

# --- Using .loc (Label-based) ---
print("\n--- Using .loc (Label-based) ---")

# 1. Select a single element by label
print(f"s.loc['banana']: {s.loc['banana']}")

# 2. Select multiple elements by a list of labels (Fancy Indexing)
print(f"\ns.loc[['apple', 'date']]:\n{s.loc[['apple', 'date']]}")

# 3. Slice by label (INCLUSIVE of the end label)
# Selects elements from 'banana' up to and including 'date'
print(f"\ns.loc['banana':'date']:\n{s.loc['banana':'date']}")

# --- Using .iloc (Integer-location based) ---
print("\n--- Using .iloc (Integer-location based) ---")

# 1. Select a single element by integer position (0-indexed)
print(f"s.iloc[1]: {s.iloc[1]}") # This selects the element at position 1 (which is 'banana')

# 2. Select multiple elements by a list of integer positions (Fancy Indexing)
print(f"\ns.iloc[[0, 3]]:\n{s.iloc[[0, 3]]}") # Selects elements at positions 0 ('apple') and 3 ('date')

# 3. Slice by integer position (EXCLUSIVE of the end position, like Python lists)
# Selects elements from position 1 up to (but not including) position 4
# This selects elements at positions 1, 2, 3 ('banana', 'cherry', 'date')
print(f"\ns.iloc[1:4]:\n{s.iloc[1:4]}")



--- Original Series 's' ---
apple         100
banana        200
cherry        300
date          400
elderberry    500
dtype: int64
------------------------------

--- Using .loc (Label-based) ---
s.loc['banana']: 200

s.loc[['apple', 'date']]:
apple    100
date     400
dtype: int64

s.loc['banana':'date']:
banana    200
cherry    300
date      400
dtype: int64

--- Using .iloc (Integer-location based) ---
s.iloc[1]: 200

s.iloc[[0, 3]]:
apple    100
date     400
dtype: int64

s.iloc[1:4]:
banana    200
cherry    300
date      400
dtype: int64


In [17]:
'''
Pandas DataFrame: loc vs. iloc
#=======================================

For DataFrames, loc and iloc extend to both rows and columns. 
Their general syntax is df.loc[row_selector, column_selector] and df.iloc[row_selector, column_selector].

'''


import pandas as pd

# Create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [24, 27, 22, 32, 29],
    'City': ['NY', 'LA', 'SF', 'NY', 'SF'],
    'Score': [85, 92, 78, 65, 90]
}
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])

print("--- Original DataFrame 'df' ---")
print(df)
print("-" * 30)

# --- Using .loc (Label-based for rows and columns) ---
print("\n--- Using .loc ---")

# 1. Select a single row by its label
print(f"\ndf.loc['c']:\n{df.loc['c']}")

# 2. Select specific rows and specific columns by labels
print(f"\ndf.loc[['a', 'd'], ['Name', 'Score']]:\n{df.loc[['a', 'd'], ['Name', 'Score']]}")

# 3. Slice rows by labels (inclusive) and select all columns
print(f"\ndf.loc['b':'d', :]:\n{df.loc['b':'d', :]}")

# 4. Select all rows and slice columns by labels (inclusive)
print(f"\ndf.loc[:, 'Age':'Score']:\n{df.loc[:, 'Age':'Score']}")

# 5. Boolean indexing with .loc (very common!)
# Select rows where Age > 25 and only show Name and City columns
print(f"\ndf.loc[df['Age'] > 25, ['Name', 'City']]:\n{df.loc[df['Age'] > 25, ['Name', 'City']]}")

# --- Using .iloc (Integer-location based for rows and columns) ---
print("\n--- Using .iloc ---")

# 1. Select a single row by its integer position
print(f"\ndf.iloc[2]:\n{df.iloc[2]}") # Corresponds to row with label 'c'

# 2. Select specific rows and specific columns by integer positions
print(f"\ndf.iloc[[0, 3], [0, 3]]:\n{df.iloc[[0, 3], [0, 3]]}") # (row 0, row 3) and (col 0, col 3)

# 3. Slice rows by integer positions (exclusive) and select all columns
print(f"\ndf.iloc[1:4, :]:\n{df.iloc[1:4, :]}") # Rows at pos 1, 2, 3 (labels 'b', 'c', 'd')

# 4. Select all rows and slice columns by integer positions (exclusive)
print(f"\ndf.iloc[:, 1:3]:\n{df.iloc[:, 1:3]}") # Columns at pos 1, 2 ('Age', 'City')

# 5. Using boolean indexing with .iloc (less common, usually loc is clearer for this)
# You'd typically use .loc for boolean masks as the mask is label-aligned
# Example: Select rows where Age > 25 and only show first two columns
print(f"\ndf.iloc[(df['Age'] > 25).values, :2]:\n{df.iloc[(df['Age'] > 25).values, :2]}")
# Note: (df['Age'] > 25).values converts the Series of booleans to a NumPy array of booleans,
# which .iloc can accept for row selection.

--- Original DataFrame 'df' ---
      Name  Age City  Score
a    Alice   24   NY     85
b      Bob   27   LA     92
c  Charlie   22   SF     78
d    David   32   NY     65
e      Eve   29   SF     90
------------------------------

--- Using .loc ---

df.loc['c']:
Name     Charlie
Age           22
City          SF
Score         78
Name: c, dtype: object

df.loc[['a', 'd'], ['Name', 'Score']]:
    Name  Score
a  Alice     85
d  David     65

df.loc['b':'d', :]:
      Name  Age City  Score
b      Bob   27   LA     92
c  Charlie   22   SF     78
d    David   32   NY     65

df.loc[:, 'Age':'Score']:
   Age City  Score
a   24   NY     85
b   27   LA     92
c   22   SF     78
d   32   NY     65
e   29   SF     90

df.loc[df['Age'] > 25, ['Name', 'City']]:
    Name City
b    Bob   LA
d  David   NY
e    Eve   SF

--- Using .iloc ---

df.iloc[2]:
Name     Charlie
Age           22
City          SF
Score         78
Name: c, dtype: object

df.iloc[[0, 3], [0, 3]]:
    Name  Score
a  Alice     85


# When to use loc vs. iloc:

In [19]:

'''
loc:

Always prefer loc when you know the row and column labels. It makes your code more readable and robust to changes in data order (e.g., if a new row is inserted, the integer position changes but the label stays the same).
Essential for boolean indexing (e.g., df.loc[df['column'] > value]).
iloc:

Use iloc when you need to select rows/columns purely by their positional order, regardless of their labels.
Useful in loops or when writing functions where you iterate through positions.
Crucial Distinction for Slicing:

loc slicing: The end label is inclusive. df.loc['start':'end'] includes 'end'.
iloc slicing: The end position is exclusive. df.iloc[start:end] does not include end. (This is consistent with standard Python list slicing).
Mastering loc and iloc is fundamental for effective data manipulation and analysis in Pandas.

'''

"\nloc:\n\nAlways prefer loc when you know the row and column labels. It makes your code more readable and robust to changes in data order (e.g., if a new row is inserted, the integer position changes but the label stays the same).\nEssential for boolean indexing (e.g., df.loc[df['column'] > value]).\niloc:\n\nUse iloc when you need to select rows/columns purely by their positional order, regardless of their labels.\nUseful in loops or when writing functions where you iterate through positions.\nCrucial Distinction for Slicing:\n\nloc slicing: The end label is inclusive. df.loc['start':'end'] includes 'end'.\niloc slicing: The end position is exclusive. df.iloc[start:end] does not include end. (This is consistent with standard Python list slicing).\nMastering loc and iloc is fundamental for effective data manipulation and analysis in Pandas.\n\n"

In [None]:
# Note again: While slicing with 

# .loc[start, end] end position is inclusive
# .iloc[start, end] end position is exclusive

# Data

In [None]:
# Data can come in different formats:
# csv (Comma Seperated Values)
# json (Java Script Object Notation)
# xml (Extensive Markup Language)

In [None]:
# json

'''
Let's define JSON-like datasets in Python, demonstrating how to structure them to represent two individuals with "name" and "age" fields, first as a list of dictionaries (implicitly indexed 0, 1) and then as a dictionary where "data" is a key holding that list.

'''

In [21]:
'''Method 1: List of Dictionaries (Implicit Index 0, 1)
#============================================================

This is the most common and natural way to represent a collection of structured records in JSON. Each record is a dictionary, and the overall structure is a list of these dictionaries. When you load this into Pandas, it will automatically get a default integer index (0, 1, ...).
'''

import json

# List of dictionaries, where each dictionary represents a record
json_data_list = [
    {
        "name": "Alice",
        "age": 30
    },
    {
        "name": "Bob",
        "age": 24
    }
]

print("--- JSON Data (List of Dictionaries) ---")
# Using json.dumps for pretty printing in Python, looks like a standard JSON array
print(json.dumps(json_data_list, indent=4))
print("\nThis structure implicitly has index 0 and 1 for the two records.")

--- JSON Data (List of Dictionaries) ---
[
    {
        "name": "Alice",
        "age": 30
    },
    {
        "name": "Bob",
        "age": 24
    }
]

This structure implicitly has index 0 and 1 for the two records.


In [22]:
'''

Method 2: Dictionary with "data" as Index (Key)
#==================================================

Here, we wrap the list of dictionaries inside another dictionary, where "data" is the key pointing to that list. This is useful if you have metadata alongside your main data array in the JSON.
'''

import json

# Dictionary where a key (e.g., "data") holds the list of dictionaries
json_data_with_key = {
    "metadata": {
        "source": "example_data",
        "version": "1.0"
    },
    "data": [
        {
            "name": "Charlie",
            "age": 35
        },
        {
            "name": "Diana",
            "age": 28
        }
    ]
}

print("\n--- JSON Data (Dictionary with 'data' key) ---")
# Using json.dumps for pretty printing
print(json.dumps(json_data_with_key, indent=4))
print("\nHere, the list of records is accessed via the 'data' key.")


--- JSON Data (Dictionary with 'data' key) ---
{
    "metadata": {
        "source": "example_data",
        "version": "1.0"
    },
    "data": [
        {
            "name": "Charlie",
            "age": 35
        },
        {
            "name": "Diana",
            "age": 28
        }
    ]
}

Here, the list of records is accessed via the 'data' key.


In [None]:
'''
Why these structures are common for JSON data
#===============================================


List of Dictionaries (Method 1): Directly maps to a common structure where each object in a JSON array represents a row or record. 
This is what pd.read_json() often expects by default for simple datasets, or when you create a DataFrame from a list of dicts.
Dictionary with a Key (Method 2): Often seen in APIs or larger data files where the core data is nested under a specific key
(like "data", "results", "items"), and the top-level dictionary might also contain metadata, pagination info, or other related details. 
When reading this into Pandas, you'd typically specify the record path (e.g., pd.json_normalize(json_data_with_key, 'data')).
Both of these structures are valid ways to represent your "name" and "age" data in a JSON-like format within Python. 
The choice depends on the broader context of your data and how you intend to use or exchange it.

'''

In [23]:
# Accessing csv dataset 

import pandas as pd

# Define the URL for the Titanic dataset
titanic_dataset_github_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv"

# Read the CSV file into a Pandas DataFrame
try:
    df_titanic = pd.read_csv(titanic_dataset_github_url)
    print("CSV file loaded successfully!")

    # Display the first 5 rows of the DataFrame (head)
    print("\n--- DataFrame Head (First 5 Rows) ---")
    print(df_titanic.head())

    # Display the last 5 rows of the DataFrame (tail)
    print("\n--- DataFrame Tail (Last 5 Rows) ---")
    print(df_titanic.tail())

except Exception as e:
    print(f"Error loading CSV file: {e}")
    print("Please check the URL or your internet connection.")

CSV file loaded successfully!

--- DataFrame Head (First 5 Rows) ---
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C1

In [25]:
df_titanic['Survived']

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

In [28]:
df_titanic[['PassengerId', 'Survived']]

Unnamed: 0,PassengerId,Survived
0,1,0
1,2,1
2,3,1
3,4,1
4,5,0
...,...,...
886,887,0
887,888,1
888,889,0
889,890,1


In [31]:
import pandas as pd

# Define the URL for the Titanic dataset
titanic_dataset_github_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv"

# Read the CSV file into a Pandas DataFrame
try:
    df_titanic = pd.read_csv(titanic_dataset_github_url)
    print("Titanic dataset loaded successfully!")
except Exception as e:
    print(f"Error loading CSV file: {e}")
    print("Please check the URL or your internet connection.")

# --- Displaying with .loc (Label-based) ---
print("\n--- Displaying with .loc (Label-based) ---")

# .loc[row_label(s), column_label(s)]

# 1. Select specific rows by their index labels and specific columns by their names
# Let's pick rows 0, 2, and 5, and columns 'Name', 'Age', 'Survived'.
# Note: For this dataset, the default index labels are integers (0, 1, 2, ...).
# So, using loc with integer labels is possible, but it's still label-based.
print("\nSelected rows 0, 2, 5 and columns 'Name', 'Age', 'Survived' using .loc:")
print(df_titanic.loc[[0, 2, 5], ['Name', 'Age', 'Survived']])

# 2. Select a slice of rows and all columns
# Remember, .loc slicing is INCLUSIVE of the end label.
print("\nRows from index 10 to 14 (inclusive) and all columns using .loc:")
print(df_titanic.loc[10:14, :])

# 3. Select all rows for a specific set of columns
print("\nAll rows for 'Name', 'Sex', 'Fare' columns using .loc:")
print(df_titanic.loc[:, ['Name', 'Sex', 'Fare']].head()) # Showing head for brevity

# 4. Conditional selection with .loc (very common and powerful!)
# Select passengers who survived (Survived = 1) and show their 'Name', 'Age', 'Fare'.
print("\nPassengers who survived (first 10) and their Name, Age, Fare using .loc:")
print(df_titanic.loc[df_titanic['Survived'] == 1, ['Name', 'Age', 'Fare']].head(10))

# --- Displaying with .iloc (Integer-location based) ---
print("\n--- Displaying with .iloc (Integer-location based) ---")

# .iloc[row_position(s), column_position(s)]
# Remember, .iloc slicing is EXCLUSIVE of the end position.

# 1. Select specific rows by their integer positions and specific columns by their integer positions
# Let's select rows at positions 0, 2, 5 and columns at positions 3 (Name), 5 (Age), 1 (Survived).
print("\nSelected rows at pos 0, 2, 5 and cols at pos 3, 5, 1 using .iloc:")
print(df_titanic.iloc[[0, 2, 5], [3, 5, 1]])

# 2. Select a slice of rows and a slice of columns by their integer positions
# Rows from position 10 to 14 (exclusive of 14, so positions 10, 11, 12, 13)
# Columns from position 0 to 4 (exclusive of 4, so positions 0, 1, 2, 3)
print("\nRows from pos 10-13, Cols from pos 0-3 using .iloc:")
print(df_titanic.iloc[10:14, 0:4])

# 3. Select all rows and specific columns by their integer positions
# Columns 'Name' (pos 3), 'Sex' (pos 4), 'Fare' (pos 8)
print("\nAll rows for columns at pos 3, 4, 8 using .iloc:")
print(df_titanic.iloc[:, [3, 4, 8]].head()) # Showing head for brevity

# 4. Select the very first row and all columns
print("\nFirst row using .iloc[0, :]:")
print(df_titanic.iloc[0, :])

# 5. Select the last row and all columns
print("\nLast row using .iloc[-1, :]:")
print(df_titanic.iloc[-1, :])

Titanic dataset loaded successfully!

--- Displaying with .loc (Label-based) ---

Selected rows 0, 2, 5 and columns 'Name', 'Age', 'Survived' using .loc:
                      Name   Age  Survived
0  Braund, Mr. Owen Harris  22.0         0
2   Heikkinen, Miss. Laina  26.0         1
5         Moran, Mr. James   NaN         0

Rows from index 10 to 14 (inclusive) and all columns using .loc:
    PassengerId  Survived  Pclass                                  Name  \
10           11         1       3       Sandstrom, Miss. Marguerite Rut   
11           12         1       1              Bonnell, Miss. Elizabeth   
12           13         0       3        Saundercock, Mr. William Henry   
13           14         0       3           Andersson, Mr. Anders Johan   
14           15         0       3  Vestrom, Miss. Hulda Amanda Adolfina   

       Sex   Age  SibSp  Parch     Ticket     Fare Cabin Embarked  
10  female   4.0      1      1    PP 9549  16.7000    G6        S  
11  female  58.0     

# Conditional Indexing and Slicing with dataframe

In [None]:
'''
Conditional slicing and indexing in Pandas is an incredibly powerful technique that allows you to
select rows (and sometimes columns) from your DataFrame or Series based on specific criteria or conditions,
 rather than just by their labels or positions. It's often referred to as boolean indexing or boolean masking.

The core idea is:

You create a boolean Series (or array) where True indicates rows (or elements) that meet your condition, 
and False indicates those that don't.
You then use this boolean Series to select the corresponding rows (or elements) from your DataFrame or Series.
Let's illustrate with the df_titanic DataFrame.

'''

In [32]:
import pandas as pd

# Define the URL for the Titanic dataset
titanic_dataset_github_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv"

# Read the CSV file into a Pandas DataFrame
try:
    df_titanic = pd.read_csv(titanic_dataset_github_url)
    print("Titanic dataset loaded successfully!")
    print("\nOriginal DataFrame Head:")
    print(df_titanic.head())
    print("-" * 50)
except Exception as e:
    print(f"Error loading CSV file: {e}")
    print("Please check the URL or your internet connection.")

# --- Conditional Indexing / Slicing ---

# 1. Select rows where 'Age' is greater than 30
print("\n--- Passengers older than 30 (All columns) ---")
condition_age = df_titanic['Age'] > 30
# The 'condition_age' variable is a Series of True/False values:
# print(condition_age.head())
# Output:
# 0    False
# 1     True
# 2    False
# 3     True
# 4     True
# Name: Age, dtype: bool

# Use the boolean Series to filter rows
older_passengers = df_titanic[condition_age]
# Or more concisely: older_passengers = df_titanic[df_titanic['Age'] > 30]
print(older_passengers.head())
print(f"Number of passengers older than 30: {len(older_passengers)}")

# 2. Select rows where 'Survived' is 1 (meaning survived) AND 'Pclass' is 1 (first class)
print("\n--- First-class passengers who survived ---")
condition_survived = df_titanic['Survived'] == 1
condition_pclass = df_titanic['Pclass'] == 1

# Combine conditions using '&' (logical AND)
# Remember to wrap individual conditions in parentheses due to operator precedence
first_class_survivors = df_titanic[condition_survived & condition_pclass]
# Or: first_class_survivors = df_titanic[(df_titanic['Survived'] == 1) & (df_titanic['Pclass'] == 1)]
print(first_class_survivors.head())
print(f"Number of first-class survivors: {len(first_class_survivors)}")

# 3. Select rows where 'Sex' is 'female' OR 'Fare' is greater than 100
print("\n--- Female passengers OR passengers with fare > 100 ---")
condition_female = df_titanic['Sex'] == 'female'
condition_fare = df_titanic['Fare'] > 100

# Combine conditions using '|' (logical OR)
females_or_high_fare = df_titanic[condition_female | condition_fare]
print(females_or_high_fare.head())
print(f"Number of such passengers: {len(females_or_high_fare)}")

# 4. Using .loc for conditional selection with specific columns
# This is generally the recommended way for readability and performance
print("\n--- Using .loc for conditional selection (Survived & Sex) ---")
# Select name, age, and fare for male survivors
male_survivors_data = df_titanic.loc[(df_titanic['Survived'] == 1) & (df_titanic['Sex'] == 'male'),
                                     ['Name', 'Age', 'Fare']]
print(male_survivors_data.head())
print(f"Number of male survivors: {len(male_survivors_data)}")

# 5. Using .isin() for multiple discrete values
print("\n--- Passengers from Pclass 1 or 3 ---")
# Select passengers from Pclass 1 or 3
pclass_filter = df_titanic['Pclass'].isin([1, 3])
pclass_1_or_3_passengers = df_titanic[pclass_filter]
print(pclass_1_or_3_passengers.head())
print(f"Number of passengers in Pclass 1 or 3: {len(pclass_1_or_3_passengers)}")

# 6. Using .between() for numerical ranges
print("\n--- Passengers with Age between 18 and 30 (inclusive) ---")
age_range_filter = df_titanic['Age'].between(18, 30, inclusive='both')
age_18_to_30_passengers = df_titanic[age_range_filter]
print(age_18_to_30_passengers.head())
print(f"Number of passengers aged 18-30: {len(age_18_to_30_passengers)}")

# 7. Selecting rows with missing values (e.g., missing Age)
print("\n--- Passengers with missing Age ---")
missing_age_passengers = df_titanic[df_titanic['Age'].isnull()]
print(missing_age_passengers.head())
print(f"Number of passengers with missing Age: {len(missing_age_passengers)}")

# 8. Selecting rows where a string column contains a specific substring
print("\n--- Passengers whose name contains 'Mr.' (case-insensitive) ---")
# Make sure the 'Name' column is of string type and handle potential NaNs
# .str accessor is used for string methods on Series
# .fillna('') is used to treat NaN names as empty strings for the contains check
mr_passengers = df_titanic[df_titanic['Name'].str.contains('Mr\.', na=False, case=False)]
print(mr_passengers.head())
print(f"Number of passengers with 'Mr.' in name: {len(mr_passengers)}")

  mr_passengers = df_titanic[df_titanic['Name'].str.contains('Mr\.', na=False, case=False)]


Titanic dataset loaded successfully!

Original DataFrame Head:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123    