#  Food Delivery Data Exploration and analysis 1

# Agenda

1. **Introduction to the Dataset & Business Context**
2. **Introduction to Data Analysis & Visualization (DAV)**
3. **Python Lists vs NumPy Arrays**
   - <span style="color: violet;"> Why Use NumPy Arrays? </span>
4. **Dimensions & Shape**
   - Understanding Dimensions & Shape
   - `np.arange()`
5. **Type Conversion in NumPy Arrays**
6. **Indexing & Slicing**

---

# What is Machine Learning

## <span style="color: skyblue;"> The Core Concept </span>
Machine Learning is the ability of a machine to learn from experience, history, and patterns, similar to how humans learn.

## <span style="color: skyblue;"> Example: The Bank Loan Story </span>
- **Scenario:** A bank needs to give a ₹20 lakh loan to either Candidate 1 (C1) or Candidate 2 (C2).
- **Data Point 1:** C1 earns ₹5 lakh/month; C2 earns ₹5 lakh/year.
    - <span style="color: violet;"> Which one will you choose? </span>
    - Most choose C1 based on capacity.
- **Data Point 2:** Reveal that C1 is Vijay Mallya.

![Source: The Economic Times](https://hackmd.io/_uploads/rkLVHltVZe.png)

- Decisions immediately change.
- **Lesson:** Decisions are based on historical patterns (e.g., fraud history) rather than just current capacity.
- Machine Learning models help banks detect fraud by learning patterns from millions of past transactions and flagging anything unusual for future transactions.

## <span style="color: skyblue;"> Example: Salary Prediction </span>
- **Data:**
    - 1 yr experience = ₹10k
    - 2 yr experience = ₹20k
    - 3 yr experience = ₹30k

<span style="background-color: red;"> **[Ask Learners]:** </span>  
> <span style="color: violet;"> Question: What is the salary for 6 years of experience? </span>  
> Answer: ₹60k. (We found a pattern!)

- **Pattern:** Salary = Years of Experience × ₹10k.
- Humans can spot only simple patterns from a few records (here, just 3), but machines spot these patterns across 10 million records.


## <span style="color: skyblue;"> From Learning Patterns → Understanding Data </span>


In the first section, we saw that Machine Learning works by learning patterns from historical data—just like how we chose a loan candidate or predicted salary using past examples.

But this raises an important question:

**Before a machine can learn patterns, how do we understand the data itself?**

That’s where Exploratory Data Analysis (EDA) comes in.

---


# Exploratory Data Analysis and DAV

## <span style="color: violet;"> What is EDA? </span>
Exploratory Data Analysis is the process of exploring and analyzing every option/feature in a dataset before taking a decision.

- **Movie Example:** Before watching a movie, you check ratings, reviews, actors, director, genre, and price. You don't decide based on just one factor.
- **Industry Example:** Zomato launched "Pure Veg" deliveries based on analysis of what people like, not random choice. They analyze comments, reviews, and average costs for two.

These are the kinds of questions we aim to address throughout the DAV module:

- <img src=https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/100/467/original/highest-rated_restaurants.png?1734428912 width=600>
- <img src=https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/100/468/original/average_cost_for_two_people.png?1734428927 width=600>


## <span style="color: violet;"> Why is EDA Necessary? </span>
Machines are "dumb" without context.
1. **Removing Irrelevant Patterns:** A machine might think an increasing "Serial Number" causes "Salary" to increase. Humans know to remove the Serial Number column.
2. **Data Cleaning:** Real-world data is "corrupt."
    - "10k" needs to be converted to "10000" to perform math.
    - Ratings like "4.1/5" must be cleaned to "4.1" to calculate averages.
    - Costs stored in quotes (strings) must be converted to numbers.

## <span style="color: violet;"> What is DAV? </span>
DAV stands for **Data Analysis and Visualization**. It involves the tools used to understand, process, and present data meaningfully:
- **NumPy:** Used for numerical data (integers, floats).
- **Pandas:** Used for text, reviews, grouping, and tabular data.
- **Matplotlib & Seaborn:** Used for creating charts and graphs.

---

# Colab and PyPi

## <span style="color: skyblue;"> The Python Ecosystem </span>
- **PyPi (Python Package Index):** Think of this as the "App Store" or "Play Store" for Python.
- There are approximately **700,000 projects** available on PyPi.
- In this module, we focus on only **4 libraries** (NumPy, Pandas, Matplotlib, Seaborn).
    - Imagine how important they are!
    - These 4 form the "Rock Solid Base" for all advanced topics like Deep Learning, Generative AI, and Agentic AI.


## <span style="color: skyblue;"> The Coding Environment </span>
To implement the concepts of this module, we use **Google Colab**.
- **Definition:** A cloud-based coding platform powered by Google.
- **Benefits:**
  - Used in the industry for Proof of Concepts (POCs) and demos.
  - Provides free access to expensive hardware like **GPUs** and **TPUs**.
  - Inbuilt support for libraries like NumPy, Pandas, and Matplotlib.

Demonstrating how to set up the Colab Notebook for writing Python code and learning the concepts.

<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/171/575/original/1.jpg?1766465560" width="700">

* Click on the 'New notebook' to create a new file.


## <span style="color: skyblue;"> Import Syntax and Aliases </span>
Instead of typing the full library name repeatedly, we use nicknames (aliases):
- **Command:** `import numpy as np`
- `np` is the standard industry nickname.
    - While you could technically use any name (like `chintu`),
    - Using standards makes code readable for others.

## <span style="color: skyblue;"> Python Basics </span>

Python is a very simple and easy-to-understand programming language.
Let's start with the basics of Python:

1. Printing a simple statement in Python:


#### Code:


In [None]:
print("Hello AIML batch Dec(2)")

Hello AIML batch Dec(2)


> <span style="color: red;"> **Note:** </span> In Colab, each cell can be executed by clicking the `Play` button or by clicking `Shift + Enter`.

2. Python is not a strongly typed language. It does not require declaring the data type, as it automatically interprets its data type.

In [None]:
x = 3
y = "Hello"
print(x)
print(y)
print(type(x))
print(type(y))

3
Hello
<class 'int'>
<class 'str'>


3. Creating a list in Python.

#### Code:

In [None]:
students = [1, 2, 3, 4, 5]
print(type(students))

<class 'list'>


4. Python can have a list with different data types. It allows this functionality and doesn't give an error. Thus, a list is not associated with a primitive data type.

#### Code:

In [None]:
x = [1, 'python', 'Sri', True]
print(x)

[1, 'python', 'Sri', True]


> Similarly, there are a few more data types, such as float, bool, etc.


# Introduction to NumPy

## <span style="color: violet;"> What is NumPy? </span>
- Stands for **Numeric Python**.
- Deals with numbers (integers and floats).
- **Import convention:** `import numpy as np`

## <span style="color: skyblue;"> NumPy Array vs. Python List </span>
Python lists are not designed for mathematical operations.
- **List behavior:** `[4, 5, 6] * 2` results in `[4, 5, 6, 4, 5, 6]` (replication). So,

1. Creating a `numpy` array

#### Code:


In [None]:
import numpy as np
votes = np.array([ 775,  787,  918,   88,  166,  286, 2556,  324,  504,  402])
costs = np.array(["'800.0'" ,"'800.0'", "'800.0'", "'300.0'", "'600.0'", "'600.0'", "'600.0'", "'700.0'" ,"'550.0'", "'500.0'"])

print("Votes (Array):", votes)
print("Costs (Array):", costs)
print(type(votes))

Votes (Array): [ 775  787  918   88  166  286 2556  324  504  402]
Costs (Array): ["'800.0'" "'800.0'" "'800.0'" "'300.0'" "'600.0'" "'600.0'" "'600.0'"
 "'700.0'" "'550.0'" "'500.0'"]
<class 'numpy.ndarray'>



2. Multiplying by 2 after importing the `numpy` library (enables vectorization)


In [None]:
votes * 2

array([1550, 1574, 1836,  176,  332,  572, 5112,  648, 1008, 8042])

## <span style="color: violet;"> Why is NumPy Faster? </span>

<span style="background-color: red;"> **[Note to Instructor]:** </span>  
> **Use the "Tea Room" analogy:** Room 1 (Tea Ingredients are scattered) vs Room 2 (Tea Ingredients kept in order). Room 2 is faster for tea making.


1. **Contiguous Memory Allocation:**
   - In a **List**, elements are stored in scattered memory locations. The list stores the addresses of these elements.
   - In **NumPy**, elements are stored one after the other in memory.
2. **Homogeneous Data Types:**
   - NumPy arrays store only one type of data at a time, which optimizes computation.

<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/995/original/download.png?1706870327" width=700 height=175>

<span style="color: orange;"> Doubts by Learner: </span>  
> <span style="color: violet;"> Question: Does NumPy store the address of the element or the actual value in its contiguous block? </span>  
> Answer: It stores the actual value.

## <span style="color: skyblue;"> Homogeneity and Priority </span>
If you mix data types in a NumPy array, it converts everything to the highest priority type:
**String > Float > Integer > Boolean**
- `[1, 2, 3.5]` → All become Float.
- `[1, "Akash", 3.5]` → All become String.
- `[True, 6]` → Boolean becomes 1 (True) or 0 (False).

<font color="magenta">**NumPy**</font> is designed for:
- **Speed & Performance**: Vectorized operations run in optimized C code under the hood.
- **Memory Efficiency**: Contiguous data storage means less overhead, better cache utilization.
- **Vectorized Operations**: Apply an operation to all elements without explicit loops.
- **Time Complexity Advantages**: Bulk operations are straightforward and often faster than pure Python loops.

In contrast, <font color="red">**Python lists**</font>:
- Are flexible but slower for large numeric computations.
- Lack built-in methods for fast vectorized math.

When NumPy converts multiple types into one (priority), you may see specific data type codes in the output:
- **Example:** `dtype='<U32'` or `'<U18'`.
- **Meaning:** The 'U' stands for **Unicode**. This indicates that the array has been converted to a **String** format because a string was present in the data.


# Array Dimensions and Shape

<span style="color: red;"> **NOTE:** </span> Here we will be using a smaller dataset to see how the operations actually work and then use a larger dataset

## <span style="color: skyblue;"> Dimensions & Shape </span>

### Understanding Dimensions & Shape

**NumPy arrays** can represent data in multiple dimensions:
- **1D arrays**: Like a simple list of votes.
- **2D arrays**: Think rows (restaurants) and columns (attributes).
- **nD arrays**: Higher dimensions for more complex data.

We use:
- `.shape` to see (rows, columns)
- `.ndim` to see how many dimensions
- `.size` to count total elements

Let’s inspect the dimensions of our `votes` and `costs`.

    
**Code:**


In [None]:
print("Votes array shape:", votes.shape)
print("Votes array dimensions:", votes.ndim)
print("Votes array size:", votes.size)

print("Costs array shape:", costs.shape)
print("Costs array dimensions:", costs.ndim)
print("Costs array size:", costs.size)


Votes array shape: (10,)
Votes array dimensions: 1
Votes array size: 10
Costs array shape: (10,)
Costs array dimensions: 1
Costs array size: 10


- These (`(10,)` ) are a 1-D vector with 10 elements.

**The Shortcut Trick:** Count the number of square brackets at the start of the array to determine the dimension.
- `[` = 1D
- `[[` = 2D

### Now, let’s create a **2D array** example using a small portion of `votes` and `costs`.

#### Sample Data:

#### Code:


In [None]:
# Take first 5 elements of votes
subset_votes = votes[:5]
subset_costs = costs[:5]

# Create a 2D array: 5 rows, 2 columns (each row: [vote_count, cost])
two_d_data = np.array([
    subset_votes,
    subset_costs
]).T  # transpose so that each row corresponds to a single restaurant

print("2D Array:\n", two_d_data)
print("Shape:", two_d_data.shape)
print("Dimensions:", two_d_data.ndim)
print("Size:", two_d_data.size)

2D Array:
 [['775' "'800.0'"]
 ['787' "'800.0'"]
 ['918' "'800.0'"]
 ['88' "'300.0'"]
 ['166' "'600.0'"]]
Shape: (5, 2)
Dimensions: 2
Size: 10


We now have a 2D array where each row is a restaurant (limited sample), and columns represent different attributes (votes, cost).

<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/064/921/original/download_%284%29.png?1707852012" width="700">

We can even imagine a 3D array, but we’ll keep it simple for now. The key idea: dimensions define how data is structured for analysis.

---

# Quiz

If `arr = np.array([1e20], dtype=np.float64)` and we do `arr.astype(np.int32)`, the result is:
# Choices
- [x] 2147483647
- [ ] OverflowError
- [ ] 1e20 rounded to nearest int
- [ ] Implementation-defined but within int32 range

Explanation:
Values outside int32 range get clipped to boundary 2147483647 / -2147483648.

# Indexing and Slicing

## <span style="color: skyblue;"> Indexing </span>
Accessing a single element based on its position (0-based).

Examples:
1. First 5 votes.
2. A slice of the first 10 costs.
3. For our 2D array, select the first 3 rows.

- Types of Indexing:
    - **Positive Indexing:** `0, 1, 2, 3...`
    - **Negative Indexing:** `-1` is the last element, `-2` is the second last.
- **Analogy:**
    - We can sort the food items on Zomato.
    - It will place the higher orders first.
    - <span style="background-color: red;"> **[Note to Instructor]:**</span> Show the demo on theiir website.
- **2D Indexing Mantra:** **Row, Column (R, C)**.
  - *Example:* `arr[1, 2]` fetches the element in the 1st row and 2nd column.

You can use list of indexes in numpy.


In [None]:
votes[[2,3,4,1,2,2]]

array([918,  88, 166, 787, 918, 918])

## <span style="color: skyblue;"> Slicing </span>
Accessing a "slice" or a small part of a larger dataset.
- **Syntax:** `array[start : end]`
- **The Race Analogy:** Python starts at the `start` point but stops exactly one step before the `end` point. The `end` index is **exclusive**.
- **Example:** To get elements up to index 2, you must write `0:3`.

### Code:


In [None]:
print("First 5 votes:", votes[:5])
print("First 10 costs:", costs[:10])

# 2D array slicing (two_d_data from above)
print("First 3 rows of the 2D array:\n", two_d_data[:3, :])

First 5 votes: [775 787 918  88 166]
First 10 costs: ["'800.0'" "'800.0'" "'800.0'" "'300.0'" "'600.0'" "'600.0'" "'600.0'"
 "'700.0'" "'550.0'" "'500.0'"]
First 3 rows of the 2D array:
 [['775' "'800.0'"]
 ['787' "'800.0'"]
 ['918' "'800.0'"]]


- We can also slice columns. For example, `two_d_data[:, 0]` gives all votes, `two_d_data[:, 1]` gives all costs.
- Indexing & slicing help us focus on relevant subsets, like top-rated restaurants or certain cost ranges.


## <span style="color: skyblue;"> 2D Slicing </span>
Use the **R, C** mantra with slicing.
- `arr[0:2, 1:3]` fetches rows 0 to 1 and columns 1 to 2.

# Question
In a $(m,n)$ array, what shape is returned by `arr[:,0]`?
# Choices
- [ ] `(1,n)`
- [x] `(m,)`
- [ ] `(m,1)`
- [ ] `(n,)`

 (Selecting a single column collapses to 1D.)

<span style="background-color: red;"> **[Note to Instructor]:** </span>
> Emphasize the "End + 1" rule for slicing so learners don't forget that the end index is excluded.
> Go through the <span style="background-color: Blue;">[colab notebook](https://colab.research.google.com/drive/1YfCn7106mc9PMgROdGeL-sGBfj4YfVRo?usp=sharing)</span> once.


<span style="background-color: red;"> **[Ask Learners]:** </span>
>  If the batch solves a high percentage of mandatory problems **(PSP)**, the instructor will share **Optional Industry Case Studies** (e.g., Bitcoin datasets, Microsoft data).