# Exercise: Data Cleaning and Analysis

In practise, data often comes labeled with codes or extreme abbreviations like "Schw_Tr_d_Le_en_W", instead of descriptive column names. Entries are often missing or erroneous, which can introduce errors to machine learning models. Data cleaning serves the purpose of fixing erroneous entries and ensuring the integrity of the dataset, but it does _not_ involve transforming the data in order to prepare it for an algorithm, e.g. via scaling. The exact steps of a data cleaning process depend on the data at hand, but often include making the data humanly interpretable, removing false/incomplete data points, fixing corrupt entries, removing duplicates, etc. 

In [None]:
# For this exercise, only use pandas
import pandas as pd

##### 1. Load "raw_data.csv" into a dataframe and rename all columns to match _Description_ from Table 1.

##### 2. Correct the data types for all _nominal_ attributes and assign the corresponding labels that are specified under _Comment_ in Table 1.

##### 3. Correct the data type of the _ordinal_ attribute "size" and assign the corresponding labels specified under _Comment_ in Table 1.

##### 4. Correct the data types for all _date_ attributes. Split "order_date" into separate columns for "weekday", "year", "month", "day" and "quarter".

##### 5. Find missing values (NaN, NaT, None), remove or fill these entries (e.g. by mean).

To deal with missing values adequately, it is important to understand what type of data is at hand, and why it is missing. For example, if the date of birth of a customer is not specified, the data point might still contain valuable information about the customer's orders, and it would be a waste to remove the complete data point. In such cases, it can make sense to keep the value as NaN or introduce a default value which makes it apparent that this value was missing.

### Now that the data is essentially clean, perform some basic analysis on it.

##### 6. Create a new column for "delivery_time" as the difference of "delivery_date" and "order_date". Inspect the created column for errors and label erroneous entries accordingly.

##### 7. Plot a histogram for the new "delivery_time_days" column. Then discretize its values into the bins "NaN", "<=5d", and ">5d" and store these in a new column "delivery_time_days_discrete". Plot a bar chart for "delivery_time_days_discrete".

##### 8. Compute the correlation matrix for the numerical attributes. Plot the matrix of the scatterplots. Plot the heatmap of the correlation matrix.