# Data Science with Python Course Exam Intermediate level

On this activity we work with the receipt's data of a papercraft company called "DM", we will integrate this data with the product dataset of the company to extract valuable information.

In [None]:
%pip install -r ../requirements.txt

import pandas as pd
import numpy as np

## Data cleansing

### 1. Load dataset


In [None]:
# 1. Load the dataset
input_file = '../data/detalle_boletas.csv'

detalle_boletas = pd.read_csv(input_file,sep=',', encoding='utf-8')
# Display the first 5 rows of the dataset to check if it was loaded correctly
print(detalle_boletas.head())

print(detalle_boletas.dtypes)

### 2. Modify dataset

Before to begin the analysis, it is necesary to modify the dataset ```detalle_boletas```. Particularly:

a. Delete the column ```Precio_prod``` because the prices are incorrect.

b. Create a column ```Pais_Venta```, because the company has intention to get into the international market in the near future. For now all the values of this column should be "Chile".

c. Change column name ```NXXX``` to ```Num Boleta``` to make it more descriptive.

You can notice, by the arrangement of the data in the dataset, when a receipt has more than 1 product then it could be several rows refering to that receipt.

![alt text](../images/receipts.png)

In this case:

- The receipt 554170000002 has two products, with ```ID``` 400009 and ```ID``` 400007. 3 and 2 units of each product were sold, respectively. It was issued on January 1, 2016.
- The receipt 554170000003 has one product, with ```ID``` 400005. 2 units of this product were sold. It was issued on January 2, 2016.
- The receipt 554170000004 has three products, with ```ID``` 400005, ```ID``` 400001, and ```ID``` 400002. 2 units of each product were sold. It was issued on January 2, 2016.


Some important considerations:

- One or more than one receipt could be issued by day.
- A receipt will never have 2 different issuance date.
- It will never exist 2 row of the same receipt and same product


In [None]:
detalle_boletas = detalle_boletas.drop(columns=['Precio_prod'])
print(detalle_boletas.head())

detalle_boletas['Pais_Venta'] = 'Chile'
print(detalle_boletas.head())

detalle_boletas = detalle_boletas.rename(columns={'NXXX': 'Num Boleta'})
print(detalle_boletas.head())

### 3. Cleansing


The previous file ```detalle_boleta.csv``` was dirty and it should be cleaned

a. There are some products with ID "4XXXXXX" y Num Boleta "55417XXXXXXX". Delete any row in the dataset that contains these values, becuase they were not correctly generated by the system and it should not be considered in the analysis.

b. The column Fecha has extra characters. Clean the column in order to obtain the format YYYY/MM/DD (without extra characters). Particularly, identify which extra characters exist in the column besides "/" or numbers and remove them.

In [None]:
filter = (detalle_boletas['ID'].str.contains('4XXXXX')) | (detalle_boletas['Num Boleta'].str.contains('55417XXXXXXX'))
print("Rows to be removed based on filter:")
print(detalle_boletas[filter])


detalle_boletas = detalle_boletas.drop(detalle_boletas[filter].index)
print(detalle_boletas.head())


# Identified extra characters in the 'Fecha' column
# The characters to be removed are: '{', '.', '_', '-', '!'
# Replaced with empty string and recognized with regex
detalle_boletas['Fecha'] = detalle_boletas['Fecha'].str.replace(r'[{._\-!]', '', regex=True)
print(detalle_boletas.head())


## Data extraction

### 4. Descriptive statistics Cantidad

Calculate the descriptive statistics of the column Cantidad for each existing product and print it to the console. The descriptive statistics should include mean, standard deviation, minimum and maximum values. See image below.


![Example descriptive statistic](../images/descriptive_statistic_example.png)

In [None]:
pivot_table_boletas = detalle_boletas.pivot_table(index=['ID'], values=['Cantidad'], aggfunc={np.amax, np.amin, np.mean, np.std})
print(pivot_table_boletas.head(10))

### 5. Data extraction: Separate Fecha column

Now that the information is clean on detalle_boletas, generate column ```Anho``` (with the year of the column ```Fecha```) and a column ```Mes``` (with the month of the column ```Fecha```) and a column ```Dia``` (with the day of the column ```Fecha```). This columns should be added to the dataframe detalle_boletas. Then, delete the column ```Fecha```.


In [None]:
separated_date = detalle_boletas['Fecha'].str.split('/', expand=True)
print(separated_date.head())


separated_date.columns = ['Anho', 'Mes', 'Dia']
detalle_boletas = detalle_boletas.join(separated_date)
print(detalle_boletas.head())

detalle_boletas= detalle_boletas.drop(columns=['Fecha'])
print(detalle_boletas.head())

## Data integration

### 6. Load Products

Load the file ```Lista productos.csv``` as a data frame and name it ```lista_productos```.

This file contains the detail of the 10 products available in stock.

![alt text](../images/products.png)

Where:
- ```ID```: Identifier of each product
- ```Nombre```: Product's name
- ```Descrip```: Product's description
- ```Precio Unitario```: Product's unit price

In [None]:
input_file_productos = '../data/Lista productos.csv'

lista_productos = pd.read_csv(input_file_productos, sep=',', encoding='utf-8')
print(lista_productos.head(10))

print(lista_productos.dtypes)


### 7. Merge dataframes

Join the DataFrame ```lista_productos``` with the DataFrame ```detalle_boletas```, based on the information in the ```ID``` column.

The resulting DataFrame from this join must contain the same information as the DataFrame ```detalle_boletas```, but now each row must also include the product name, the description, and the unit price.

You must call this DataFrame ```detalle_boletas2```. Print this DataFrame to the console.

Pay attention to the data type, because in order to perform this join, the column used to match values in both DataFrames must have the same type.

In [None]:
lista_productos['ID'] = lista_productos['ID'].astype(str)

detalle_boletas2 = detalle_boletas.merge(lista_productos, on='ID', how='left')
print(detalle_boletas2.head(20))

### 8. Total Revenue

Calculate how much revenue each receipt (boleta) generated from the sale of products.
To do this, add a new column named ```Ingreso total``` to the DataFrame ```detalle_boletas2```.
This column must contain the values resulting from multiplying the ```Precio Unitario``` column by the ```Cantidad ```column.
Print the DataFrame ```detalle_boletas2``` with this new column to the console.

In [None]:
ingreso_total = lambda x: x['Cantidad'] * x['Precio Unitario']

detalle_boletas2['Ingreso total'] = ingreso_total(detalle_boletas2)
print(detalle_boletas2.head())

### 9. Descriptive statistic Total Revenue

Finally, calculate descriptive statistics of the column ```Ingreso total``` for each of the products that exist.
The descriptive statistics you must calculate are: mean (media), standard deviation (desviación estándar), minimum (mínimo), and maximum (máximo).
Your result should look like this:

![Descriptive statistics example 2](../images/descriptive_statistic_example_2.png)

In [None]:
descriptive_statistic = detalle_boletas2.pivot_table(index=['ID'], values=['Ingreso total'], aggfunc={np.amax, np.amin, np.mean, np.std, np.sum})
print(descriptive_statistic.head(10))