The next cell downloads the required data set to carry out the workshop. This kind of code only works on Google Colab, and if you are using another platform to run the notebook, you will need to manually download the data directory from the GitHub repository and put it in the same path as the notebook file.

In [None]:
!wget -O main.zip https://github.com/jandion/APSV-BusinessIntelligence/archive/refs/heads/main.zip
!unzip main.zip
!rm -rf data
!mv APSV-BusinessIntelligence-main/data .
!rm -r APSV-BusinessIntelligence-main main.zip

In this workshop, we will work with a set of data obtained from a real logistics process. These data contain events of a logistics process in which a series of goods or items are transported from one station to another by train. Trains move wagons in which various items can be transported. For each event, 3 instants of time are recorded:
1. The time in which the event is planned
2. The time at which the event was estimated.
3. The time at which the event actually occurred.

The fields in the data set have been anonymized, except for the times and descriptions of the events. The types of events appear in Spanish; the following table is a translation of these into English:

Spanish | English
---|---
'EXPEDICION DE VAGONES'          | 'EXPEDITION OF WAGONS'
'FIN DE CARGA DE VAGONES'        | 'END OF WAGON LOADING'
'FIN DE DESCARGA DE VAGONES'     | 'END OF WAGON UNLOADING'
'LLEGADA A DESTINO DE VAGONES'   | 'ARRIVAL AT DESTINATION OF WAGONS'
'LLEGADA DE MERCANCIAS'          | 'ARRIVAL OF ITEMS'
'LLEGADA DE VAGONES'             | 'ARRIVAL OF WAGONS'
'SALIDA DE VAGONES'              | 'DEPARTURE OF WAGONS'

Our goal is to obtain as much information as we can from this data set. We will achieve this in two ways: we will answer questions with numeric values (e.g. how many trains take part in the proccess?) or we will generate some charts to present information in a visual way (e.g. how is the distribution of items transported throughout the year?)


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("data/trains.zip")

# Dataset exploration
Any project related to data analysis starts with a study of the data itself. 
* What kind of data do we have?
* How is it organized? How many columns have each dataframe?
* Are there wrong or missing values?
* What does each value of a column mean? How many different values are in each column?

With the examples we saw in the previous workshop we try to answer those questions.

In [None]:
data.event_location.describe()

In [None]:
data.info(show_counts=True)

In [None]:
data.describe()

# Data Cleaning
After a first view of the data and before starting to work with it, we must clean it. This process is called preprocessing, and it is crucial to be able to obtain good results. We will discard invalid data, fill missing values, drop redundant information, correct typos, etc. We need to create a data set with the following restrictions.

* All columns must contain relevant information.
* Rows with null or incorrect values should be discarded unless these values can be retrieved in some way (https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html, https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html and https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)
* The types of the columns must correspond to the type of data they contain (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html)

Python and Pandas have a multitude of methods that make it easier to work with dates. In this workshop can be useful those that allow us to obtain certain fragments of a date (hour of day, the day of the week, etc.) https://docs.python.org/3/library/datetime.html. An effective way to apply a change to an entire column is with a syntax like `data.planned_date.dt.hour`, which allows us to get the hour of the dates in the column `planned_date`

# Data Analytics

Once we know about the data that we are using and have cleaned it, we need to ask what kind of useful information we can extract from those data. It is a good idea to make a kind of brainstorming of possible questions, then take the list of resulting questions and sort them by their difficulty, and finally begin to answer them starting by the easiest ones.

## Basic results

Some of the easiest questions or more basic results we can obtain are the following:
* Numeric results
    * How many packages are there?
    * How many trains are there?
    * How many wagons are there?
    * How many stations are there?
    * How many routes are there? (combination of origin and destination)
    * Which is the most used train?
    * How many types of events are there?
    * What is the average delay of each type of event?
    * How many packages are processed at each of the stations?

* Graphic results
    * Bar chart of the number of packages processed at each station
    * Histogram of the delay in minutes
    * Distributions of the number of packages handled according to dates (all three), i.e. the number of unique item_id per day.

By answering these questions, you may discover some inconsistencies or errors in the data that you did not detect in the cleaning phase. This is usual in this type of task; you can return to the preprocessing phase with this new knowledge and refine the cleaning of the data.

Some of those questions are just countings, summations, or rankings, but others need data aggregations with groupby.

**Answer each question in a different cell of this notebook (remember that you can insert a new cell from the Insert menu).**


# Intermediate results

* Numeric results
    * How many different routes do each of the trains do? 
    * What are the average, max, and min durations of the complete process of a package? (difference between the first and last date registered for an item)
    * What is the day of the week on which more shipments are made?
    * In which months each train is active?, and on what days of the week?
    * How long does it take on average to load a wagon?
    * How many packages does each train carry on each journey? And how many wagons?
    * On which stations does each client operate?

* Graphic results.
    * Distribution of the delay in minutes of the event x by train
    * Number of packages (cumulative) that each station has processed over time
    * Distribution of delays according to the type of event

**Answer at least 5 of these questions**

## Advanced results

* Identification of stations
    The sources and destinations that appear in the data correspond to platforms of the different stations. For example, LOCATION 19 is a platform of STATION 8. The task consists of creating two new columns, origin_station and destination_station obtained from identifying which locations correspond to each station. 

    Tip: This is an exploration task that can be done in several ways, one way to start is with the meaning of events. The event " LLEGADA A DESTINO DE VAGONES " is always recorded at the destination station of an item.


* Processes: 
    The life cycle of each package can be viewed as a sequence of states representing the logistics process.
    * How many different process sequences appear in the data? Which one or which are correct?
    * Among those that are correct, which are the most frequent?
    * How could you correct the wrong ones?

**Optional**