# Lab Practice: Built-in data structures and functions in Python

In this lab, you apply Python's built-in data structures and functions to data processing.

A popular technique when processing data is the implementation of different tasks through a pipeline. In data science, a **pipeline** consists of a set of tasks that change raw data into a more suitable format to conduct downstream analysis. The construction of pipelines simplifies the identification and implementation of the different operations required to analyze a dataset. 

Learning objectives:
- Practice creating custom functions
- Construct a data analysis pipeline

### Some helpful notes:
- This lab practice should take about two hours to complete. If a problem takes you too long to solve, don't hesitate to ask your instructor in class or post your questions in Canvas.
- Do not try to solve the problems by copy&pasting solutions from other sources. Engaging in this kind of practice may lead you to finish this lab faster, but it won't contribute to your learning and improving your problem-solving skills. 
- When browsing for examples about Python's utilities and features, try to understand the logic and purpose of the functionality you incorporate in your solutions.
- Make sure you review the examples covered previously in the tutorials before attempting this lab.

## Part I: Functions

### Problem 1: Distance meter

When working with data, combining or transforming existing attributes into new attributes is a common task. Consider the following record that contains (x,y) coordinates for two points:

Point 1: x = 10, y = 15; Point 2: x = 20, y = 25

Another way to represent the information contained in this record is to calculate the distance between the points. For this problem, write a function that calculates the Euclidean distance between the two points above.

Notes:

- Create a *tuple* to represent each point. Your solution should work regardless of the values in the coordinate variables.
- You may use Python's [math](https://docs.python.org/3.7/library/math.html) module

In [150]:
def calc_distance (p_1, p_2):
    distance = None
    
    # YOUR SOLUTION
    
    return distance

# calc_distance(p_1, p_2)

### Problem 2: Pricing

Write a function that implements a pricing strategy that reduces the price of a purchase order by 10% only if the order exceeds $100. The price received by your function is before tax.

Your function should return four values:
1. Total with the discount applied (if the discount applies)
2. Discount (if any, otherwise the discount is 0)
3. Sales tax
4. Total with sales tax included

> As a good programming practice, every constant value in your code (e.g., tax rate, discount rate, etc.) should be assigned to a variable and avoid using literal values elsewhere. In this way, you can define constant values in a single location in your program, which makes it easier to update these values in the future if necessary.

In [152]:
SALES_TAX_RATE = 0.065

def pricing(total):
    # YOUR SOLUTION
    return 0, 0, 0, 0

total, discount, tax, total_w_tax = pricing(100)
print(f"""
   The details of the purchase order are:
       Total: ${total:.2f}
       Discount: ${discount:.2f}
       Tax ({SALES_TAX_RATE * 100})%: ${tax:.2f}
       Total after tax: ${total_w_tax:.2f}
""")


   The details of the purchase order are:
       Total: $0.00
       Discount: $0.00
       Tax (6.5)%: $0.00
       Total after tax: $0.00



## Part II: Building a data analysis pipeline 

The string below contains an excerpt of sales numbers from a retail shop. The data comes from a transactional system used at the shop. Every time a customer checks out an order, the system records details of each item sold, such as the item's name, price, time, and date:

In [214]:
sales_data = """
Queen Microfiber Sheet Set,   $19.00, 07/20/2021 10:23
Rubbermaid 18pc Plastic Food, $9.99,  07/20/2021 10:45
Plastic Mixing Bowl Set of 3, $8.00,  08/20/2021 10:45
Pryce Silverware Set 20-pc.,  $10.00, 08/21/2021 11:00
Black+Decker 0.7 cu ft,       $59.99, 08/21/2021 11:20
Swing Top Wastebasket,        $44.44, 09/10/2021 11:20
Delux Bed Pillow,             $20.15, 09/10/2021 13:10
BELLA 2 Slice Toaster,        $5.10,  10/10/2021 13:10
Colander,                     $10.21, 11/10/2021 13:10
Tower Fan Oscillating,        $30.11, 11/10/2021 9:10
"""

sales_records

'\nQueen Microfiber Sheet Set,   $19.00, 07/20/2021 10:23\nRubbermaid 18pc Plastic Food, $9.99,  07/20/2021 10:45\nPlastic Mixing Bowl Set of 3, $8.00,  08/20/2021 10:45\nPryce Silverware Set 20-pc.,  $10.00, 08/21/2021 11:00\nBlack+Decker 0.7 cu ft,       $59.99, 08/21/2021 11:20\nSwing Top Wastebasket,        $44.44, 09/10/2021 11:20\nDelux Bed Pillow,             $20.15, 09/10/2021 13:10\nBELLA 2 Slice Toaster,        $5.10,  10/10/2021 13:10\nColander,                     $10.21, 11/10/2021 13:10\nTower Fan Oscillating,        $30.11, 11/10/2021 9:10\n'

The marketing department wants to conduct a sales analysis using these numbers. Since the original data source contains hundreds of thousands of records, working with a small subset makes it easier to write and test the programming code we need to conduct the sales analysis. 

The end goal is to calculate the total sales per month. We want to generate data that would enable the marketing team to create a report like this one:

<table>
    <tr>
        <th>Month</th>      <th>Total</th>
    </tr>
    <tr>
        <td>July</td>      <td>\$28.99</td>
    </tr> 
    <tr>    
        <td>August</td>    <td>\$77.99</td>
    </tr> 
    <tr>    
        <td>September</td> <td>\$64.59</td>
    </tr> 
    <tr>    
        <td>October</td>   <td>\$5.10</td>
    </tr> 
    <tr>    
        <td>November</td>  <td>\$40.32</td>
    </tr>
</table>

Of course, there are several ways to meet the end goal. We could create a function to implement all the logic required to extract the sales numbers and aggregate them by month. The problem with creating a single function is that it becomes difficult to maintain it in the future.

In the following problems, you will implement a pipeline for analyzing the data, which in this case consists of generating monthly sales totals. 

### Problem 3: Data Representation

Write a function called *parse* that:
- Receives the raw data as a parameter
- Parses (transforms) it into a list of lists

Each nested list item represents an attribute of each sales record: the item's name, price, and date.

Useful functions for this task:
- [split](https://www.programiz.com/python-programming/methods/string/split)

Notes:
- Keep the focus of your parse function to only extracting the different attributes in each record to construct a data structure, as shown in the example below.
- This data structure may contain empty values that you will filter out in the following problem.
- Use *list comprehension* when possible to write more concise code.

In [161]:
def parse(data):
    # YOUR SOLUTION
    
    return None

sales_1 = parse(sales_records)
sales_1

In [162]:
# Example of the expected output:

print("""
 [[''],
 ['Queen Microfiber Sheet Set', '   $19.00', ' 07/20/2021 10:23'], ...
""")


 [[''],
 ['Queen Microfiber Sheet Set', '   $19.00', ' 07/20/2021 10:23'], ...



### Problem 4: Data cleaning - removing missing values

Write a function called *remove_empty* that:
- Receives the data processed by the *parse* function as a parameter
- Removes all empty elements (if any)
- Removes all trailing whitespace from each element

Useful functions for this task:
- [strip](https://www.programiz.com/python-programming/methods/string/strip)

Notes:
- Use *list comprehension* when possible to write more concise code.

In [163]:
def remove_empty(data):
    # YOUR SOLUTION    
    
    return None

sales_2 = remove_empty(sales_1)
sales_2

In [164]:
# Example of the expected output:

print("""
 [['Queen Microfiber Sheet Set', '$19.00', '07/20/2021 10:23'],
 ['Rubbermaid 18pc Plastic Food', '$9.99', '07/20/2021 10:45'], ...
""")


 [['Queen Microfiber Sheet Set', '$19.00', '07/20/2021 10:23'],
 ['Rubbermaid 18pc Plastic Food', '$9.99', '07/20/2021 10:45'], ...



### Problem 5: Data exploration

In different cells, perform some exploratory operations on the data, such as displaying:
- The number of records in the dataset
- The first 5 recrods
- The last 5 recrds
- 5 records in alternate order, i.e., records occupying the even positions

In [200]:
# YOUR SOLUTION

In [201]:
# YOUR SOLUTION

In [202]:
# YOUR SOLUTION

In [203]:
# SOLUTION

len(sales_2)

10

### Problem 6: Data cleaning - formatting

Write a function called *format_values* that:
- Receives the data processed by the *remove_empty* function as a parameter
- Convert sales numbers to the *float* type
- Conver sales dates to the *datetime* type

Useful functions for this task:
- [replace](https://www.programiz.com/python-programming/methods/string/replace)

Notes:
- Use *list comprehension* when possible to write more concise code.

In [207]:
def format_values(data):
    # YOUR SOLUTION

    return None

sales_3 = format_values(sales_2)
sales_3

In [208]:
# Example of the expected output:

print("""
[['Queen Microfiber Sheet Set', 19.0, datetime.datetime(2021, 7, 20, 10, 23)],
 ['Rubbermaid 18pc Plastic Food', 9.99, datetime.datetime(2021, 7, 20, 10, 45)], ...
""")


[['Queen Microfiber Sheet Set', 19.0, datetime.datetime(2021, 7, 20, 10, 23)],
 ['Rubbermaid 18pc Plastic Food', 9.99, datetime.datetime(2021, 7, 20, 10, 45)], ...



### Problem 7: Data exploration (second round)

Perform a second round of data exploration by sorting the records by price in ascending order (high to low).

Notes:
- You may find helpful to learn more about the parameters of the [sort](https://www.programiz.com/python-programming/methods/list/sort) function covered in the tutorial

In [211]:
# YOUR SOLUTION

In [212]:
# Example of the expected output:

print("""
[['Black+Decker 0.7 cu ft', 59.99, datetime.datetime(2021, 8, 21, 11, 20)],
 ['Swing Top Wastebasket', 44.44, datetime.datetime(2021, 9, 10, 11, 20)],
 ...
 ['BELLA 2 Slice Toaster', 5.1, datetime.datetime(2021, 10, 10, 13, 10)]]
""")


[['Black+Decker 0.7 cu ft', 59.99, datetime.datetime(2021, 8, 21, 11, 20)],
 ['Swing Top Wastebasket', 44.44, datetime.datetime(2021, 9, 10, 11, 20)],
 ...
 ['BELLA 2 Slice Toaster', 5.1, datetime.datetime(2021, 10, 10, 13, 10)]]



### Problem 8: Data transformation - Extracting date components

Write a function called *extract_date* that:
- Receives the data processed by the *format_values* function as a parameter
- Extracts the date and day from the datetime objects

Notes:
- Use *list comprehension* when possible to write more concise code.

In [167]:
def extract_date(data):
    # YOUR SOLUTION

    return None

sales_4 = extract_date(sales_3)
sales_4    

In [179]:
# Example of the expected output:

print("""
 [['Queen Microfiber Sheet Set', 19.0, datetime.datetime(2021, 7, 20, 10, 23), '07', '20'],
  ['Rubbermaid 18pc Plastic Food', 9.99, datetime.datetime(2021, 7, 20, 10, 45), '07', '20'], ...
""")


 [['Queen Microfiber Sheet Set', 19.0, datetime.datetime(2021, 7, 20, 10, 23), '07', '20'],
  ['Rubbermaid 18pc Plastic Food', 9.99, datetime.datetime(2021, 7, 20, 10, 45), '07', '20'], ...



### Problem 9: Create a pipeline

You will implement your own pipeline to group all the processing tasks implemented so far in this problem.

Create a function called *transform* that receives a list of tasks, representing a task in the pipeline. Your function should apply all the functions in the pipeline and return the processed dataset.

In [None]:
pipeline = [parse, remove_empty, format_values, extract_date]

def transform(data, tasks):
    # YOUR SOLUTION
            
    return None

sales = transform(sales_records, pipeline)
sales

### Problem 10: Data aggregation

To finalize the data analysis, write a function called *aggregate* that:
- Receives the data processed by the pipeline as a parameter
- Aggregate the sales numbers per month in a dictionary object

Notes:
- Use *dict comprehension* when possible to write more concise code.

In [175]:
def aggregate(data):
    # YOUR SOLUTION
    
    return None

sales_totals = aggregate(sales)
sales_totals

In [176]:
# Expected output:

print("""
{'07': 28.990000000000002,
 '08': 77.99000000000001,
 '09': 64.59,
 '10': 5.1,
 '11': 40.32}
""")


{'07': 28.990000000000002,
 '08': 77.99000000000001,
 '09': 64.59,
 '10': 5.1,
 '11': 40.32}



Notes:
- If you find this problem difficult to solve, try the "divide and conquer" strategy
- You may try first to generate an intermediate dictionary that contains the list of sales per month
- You may find helpful the [reduce](https://www.geeksforgeeks.org/reduce-in-python/) function from the functools module 