# Assignment: Data Analysis with Python

In this assignment, you will apply Python's built-in data structures and functions to data analysis.

A common practice in data processing is the implementation of different processing tasks through a pipeline. In data science, a **pipeline** consists of a set of tasks that change raw data into a more suitable format to conduct downstream analysis. The construction of pipelines simplifies the identification and implementation of the different operations required to analyze a dataset. 

Skills:
- Practice creating custom functions
- Construct a data analysis pipeline

### Some helpful notes:
- When browsing for examples about Python's utilities and features, try to understand the logic and purpose of the functionality you incorporate in your solutions.
- Make sure you review the examples covered previously in the tutorials before attempting this lab.

## Building a data analysis pipeline

The string below contains an excerpt of sales numbers from a retail shop. The data comes from a transactional system used at the shop. Every time a customer checks out an order, the system records details of each item sold, such as the item's name, price, time, and date:

In [35]:
sales_records = """
Queen Microfiber Sheet Set,   $19.00, 07/20/2021 10:23
Rubbermaid 18pc Plastic Food, $9.99,  07/20/2021 10:45
Plastic Mixing Bowl Set of 3, $8.00,  08/20/2021 10:45
Pryce Silverware Set 20-pc.,  $10.00, 08/21/2021 11:00
Black+Decker 0.7 cu ft,       $59.99, 08/21/2021 11:20
Swing Top Wastebasket,        $44.44, 09/10/2021 11:20
Delux Bed Pillow,             $20.15, 09/10/2021 13:10
BELLA 2 Slice Toaster,        $5.10,  10/10/2021 13:10
Colander,                     $10.21, 11/10/2021 13:10
Tower Fan Oscillating,        $30.11, 11/10/2021 9:10
"""

sales_records

'\nQueen Microfiber Sheet Set,   $19.00, 07/20/2021 10:23\nRubbermaid 18pc Plastic Food, $9.99,  07/20/2021 10:45\nPlastic Mixing Bowl Set of 3, $8.00,  08/20/2021 10:45\nPryce Silverware Set 20-pc.,  $10.00, 08/21/2021 11:00\nBlack+Decker 0.7 cu ft,       $59.99, 08/21/2021 11:20\nSwing Top Wastebasket,        $44.44, 09/10/2021 11:20\nDelux Bed Pillow,             $20.15, 09/10/2021 13:10\nBELLA 2 Slice Toaster,        $5.10,  10/10/2021 13:10\nColander,                     $10.21, 11/10/2021 13:10\nTower Fan Oscillating,        $30.11, 11/10/2021 9:10\n'

The marketing department wants to conduct a sales analysis using these numbers. Since the original data source contains hundreds of thousands of records, working with a small subset makes it easier to write and test the programming code we need to conduct the sales analysis. 

The end goal is to calculate the total sales per month. We want to generate data that would enable the marketing team to create a report like this one:

<table>
    <tr>
        <th>Month</th>      <th>Total</th>
    </tr>
    <tr>
        <td>July</td>      <td>\$28.99</td>
    </tr> 
    <tr>    
        <td>August</td>    <td>\$77.99</td>
    </tr> 
    <tr>    
        <td>September</td> <td>\$64.59</td>
    </tr> 
    <tr>    
        <td>October</td>   <td>\$5.10</td>
    </tr> 
    <tr>    
        <td>November</td>  <td>\$40.32</td>
    </tr>
</table>

Of course, there are several ways to generate this data. We could create a single function to implement all the logic required to extract the sales numbers and aggregate them by month. The main issue with this approach is that it becomes difficult to maintain or perform changes in the future to the different operations involved in transforming the raw data into month aggregates.

In the following problems, you will implement a pipeline for analyzing the data, which in this case consists of generating monthly sales totals. 

> In each problem, use *list comprehension* when possible to make your code more concise.

### Problem 1: Data Representation

Write a function called *parse* that:
- Receives the raw data as a parameter
- Parses (transforms) it into a list of lists
- Each nested list item represents an attribute of each sales record: the item's name, price, and date.

Tips:
- Use the [split](https://www.programiz.com/python-programming/methods/string/split) function to parse the text and generate a list of records by splitting the text into a list.
- When using split, you will need to specify a separator to *split* the text into a list of strings. Use the characters "\n" and "," as separators.
- Keep the focus of your parse function to only extracting the different attributes in each record to construct a data structure, as shown in the example below.
- This data structure may contain empty values that you will filter out in the following problem.

In [36]:
def parse(data):
    # YOUR SOLUTION
    
    return [[]]

sales_1 = parse(sales_records)
sales_1

[[]]

In [37]:
# Example of the expected output:

print("""
 [[''],
 ['Queen Microfiber Sheet Set', '   $19.00', ' 07/20/2021 10:23'],
 ['Rubbermaid 18pc Plastic Food', ' $9.99', '  07/20/2021 10:45'],
 ...]""")


 [[''],
 ['Queen Microfiber Sheet Set', '   $19.00', ' 07/20/2021 10:23'],
 ['Rubbermaid 18pc Plastic Food', ' $9.99', '  07/20/2021 10:45'],
 ...]


### Problem 2: Data cleaning - removing missing values

Write a function called *remove_empty* that:
- Receives the data processed by the *parse* function as a parameter
- Removes all empty elements (if any)
- Removes all trailing whitespace from each element

Tips:
- Use the [strip](https://www.programiz.com/python-programming/methods/string/strip) function to remove trailing whitespace

In [38]:
def remove_empty(data):
    # YOUR SOLUTION    
    
    return None

sales_2 = remove_empty(sales_1)
sales_2

In [39]:
# Example of the expected output:

print("""
 [['Queen Microfiber Sheet Set', '$19.00', '07/20/2021 10:23'],
 ['Rubbermaid 18pc Plastic Food', '$9.99', '07/20/2021 10:45'], ...
""")


 [['Queen Microfiber Sheet Set', '$19.00', '07/20/2021 10:23'],
 ['Rubbermaid 18pc Plastic Food', '$9.99', '07/20/2021 10:45'], ...



### Problem 3: Data exploration

In different cells, perform some exploratory operations on the data, such as displaying:
1. The number of records in the dataset
2. The first five recrods
3. The last five recrds
4. Five records in alternate order, i.e., records occupying the even positions in the dataset

In [40]:
# 1. The number of records in the dataset

# YOUR SOLUTION

In [41]:
# 2. The first five recrods, i.e, the head of the dataset

# YOUR SOLUTION

In [42]:
# 3. The last five recrds, i.e., the tail of the dataset

# YOUR SOLUTION

In [43]:
# 4. Five records in alternate order, i.e., records occupying the even positions in the dataset

# YOUR SOLUTION

### Problem 4: Data cleaning - formatting

Write a function called *format_values* that:
- Receives the data processed by the *remove_empty* function as a parameter
- Convert prices to the *float* type
- Convert dates to the *datetime* type

Tips:
- Use the [replace](https://www.programiz.com/python-programming/methods/string/replace) function to get rid of the $ sign in the price data by *replacing* it with an empty space.
- Use the *float()* function to transform the current string prices into float values.
- Review your solution for the previous Lab Practice to process dates.

In [44]:
def format_values(data):
    # YOUR SOLUTION

    return None

sales_3 = format_values(sales_2)
sales_3

In [45]:
# Example of the expected output:

print("""
[['Queen Microfiber Sheet Set', 19.0, datetime.datetime(2021, 7, 20, 10, 23)],
 ['Rubbermaid 18pc Plastic Food', 9.99, datetime.datetime(2021, 7, 20, 10, 45)], ...
""")


[['Queen Microfiber Sheet Set', 19.0, datetime.datetime(2021, 7, 20, 10, 23)],
 ['Rubbermaid 18pc Plastic Food', 9.99, datetime.datetime(2021, 7, 20, 10, 45)], ...



### Problem 5: Data exploration (second round)

Perform a second round of data exploration by sorting the records by price in ascending order (high to low).

Tips:
- Learn more about the parameters of the [sort](https://www.programiz.com/python-programming/methods/list/sort) function covered in the tutorial

In [46]:
# YOUR SOLUTION

In [47]:
# Example of the expected output:

print("""
[['Black+Decker 0.7 cu ft', 59.99, datetime.datetime(2021, 8, 21, 11, 20)],
 ['Swing Top Wastebasket', 44.44, datetime.datetime(2021, 9, 10, 11, 20)],
 ...
 ['BELLA 2 Slice Toaster', 5.1, datetime.datetime(2021, 10, 10, 13, 10)]]
""")


[['Black+Decker 0.7 cu ft', 59.99, datetime.datetime(2021, 8, 21, 11, 20)],
 ['Swing Top Wastebasket', 44.44, datetime.datetime(2021, 9, 10, 11, 20)],
 ...
 ['BELLA 2 Slice Toaster', 5.1, datetime.datetime(2021, 10, 10, 13, 10)]]



### Problem 6: Data transformation - Extracting date components

Write a function called *extract_date* that:
- Receives the data processed by the *format_values* function as a parameter
- Extracts the date and day from the datetime objects by creating two additional items in each record
- Take a look at the example of the expected output shown below to understand how to add the date and day

In [48]:
def extract_date(data):
    # YOUR SOLUTION

    return None

sales_4 = extract_date(sales_3)
sales_4    

In [49]:
# Example of the expected output:

print("""
 [['Queen Microfiber Sheet Set', 19.0, datetime.datetime(2021, 7, 20, 10, 23), '07', '20'],
  ['Rubbermaid 18pc Plastic Food', 9.99, datetime.datetime(2021, 7, 20, 10, 45), '07', '20'], ...
""")


 [['Queen Microfiber Sheet Set', 19.0, datetime.datetime(2021, 7, 20, 10, 23), '07', '20'],
  ['Rubbermaid 18pc Plastic Food', 9.99, datetime.datetime(2021, 7, 20, 10, 45), '07', '20'], ...



### Problem 7: Create a pipeline

You will implement your own pipeline to group all the processing tasks implemented so far in this problem.

Create a function called *transform* that receives a list of tasks, representing a task in the pipeline. Your function should apply all the functions in the pipeline and return the processed dataset.

In [50]:
pipeline = [parse, remove_empty, format_values, extract_date]

def transform(data, tasks):
    # YOUR SOLUTION
            
    return None

sales = transform(sales_records, pipeline)
sales

### Problem 8: Data aggregation

To finalize the data analysis, write a function called *aggregate* that:
- Receives the data processed by the pipeline as a parameter
- Aggregate the sales numbers per month in a dictionary object

Tips:
- If you find this problem difficult to solve, try the "divide and conquer" strategy:
    1. Try first to generate an intermediate dictionary that contains the list of sales per month. In this dictionary, the keys correspond to month values.
    2. Then, iterate the dictionary to sum up the values in each month. 
- A concise solution may make use of the [setdefault](https://www.w3schools.com/python/ref_dictionary_setdefault.asp) method of Python dictionaries and dict comprehension techniques. 

In [53]:
def aggregate(data):
    # YOUR SOLUTION
    
    return None

sales_totals = aggregate(sales)
sales_totals

In [54]:
# Expected output:

print("""
{'07': 28.990000000000002,
 '08': 77.99000000000001,
 '09': 64.59,
 '10': 5.1,
 '11': 40.32}
""")


{'07': 28.990000000000002,
 '08': 77.99000000000001,
 '09': 64.59,
 '10': 5.1,
 '11': 40.32}

