# Data preparation

## Setup

In [19]:
import pandas as pd

## Data import

In [20]:
URL = 'https://raw.githubusercontent.com/kirenz/datasets/master/3_4_data.csv'
df = pd.read_csv(URL)

## Inspect data

In [21]:
df

Unnamed: 0,Month,Date,Direct Sales,Indirect Sales,Goal
0,Jan,2019-01-01,88.2,82.2,90
1,Feb,2019-02-01,76.3,71.4,90
2,Mar,2019-03-01,47.8,88.7,90
3,Apr,2019-04-01,76.1,81.0,90
4,May,2019-05-01,71.4,88.4,90
5,Jun,2019-06-01,58.6,120.2,90
6,Jul,2019-07-01,79.9,83.5,90
7,Aug,2019-08-01,69.4,73.8,90
8,Sep,2019-09-01,53.9,98.0,90
9,Oct,2019-10-01,80.8,85.1,90


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Month           12 non-null     object        
 1   Date            12 non-null     datetime64[ns]
 2   Direct Sales    12 non-null     float64       
 3   Indirect Sales  12 non-null     float64       
 4   Goal            12 non-null     int64         
dtypes: datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 612.0+ bytes


## Data preparation

### Wide vs long format

- To create a line chart with Altair, your data should be in a "long" format rather than a "wide" format. 

- In the "long" format, each row represents a single observation for a given category.

- Our current data is in a "wide" format. 

- We can use the pandas [melt function](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) to transform it into the desired "long" format:

In [23]:
# Melt the data to long format
df = df.melt(id_vars=["Month", "Date", "Goal"], 
                    value_vars=["Direct Sales", "Indirect Sales"], 
                    var_name="Sales Type", 
                    value_name="Sales")

In [24]:
df


Unnamed: 0,Month,Date,Goal,Sales Type,Sales
0,Jan,2019-01-01,90,Direct Sales,88.2
1,Feb,2019-02-01,90,Direct Sales,76.3
2,Mar,2019-03-01,90,Direct Sales,47.8
3,Apr,2019-04-01,90,Direct Sales,76.1
4,May,2019-05-01,90,Direct Sales,71.4
5,Jun,2019-06-01,90,Direct Sales,58.6
6,Jul,2019-07-01,90,Direct Sales,79.9
7,Aug,2019-08-01,90,Direct Sales,69.4
8,Sep,2019-09-01,90,Direct Sales,53.9
9,Oct,2019-10-01,90,Direct Sales,80.8


### Create new variable

In [25]:
# Create a variable called Text with a string "Goal" to use in our plot
df['Text'] = 'GOAL'

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Month       24 non-null     object        
 1   Date        24 non-null     datetime64[ns]
 2   Goal        24 non-null     int64         
 3   Sales Type  24 non-null     object        
 4   Sales       24 non-null     float64       
 5   Text        24 non-null     object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 1.3+ KB


### Change data format

In [27]:
LIST_CAT = ['Month', 'Sales Type']

for i in LIST_CAT:
    df[i]=df[i].astype('category')

In [28]:
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Month       24 non-null     category      
 1   Date        24 non-null     datetime64[ns]
 2   Goal        24 non-null     int64         
 3   Sales Type  24 non-null     category      
 4   Sales       24 non-null     float64       
 5   Text        24 non-null     object        
dtypes: category(2), datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 1.4+ KB


### Save data

In [31]:
df.to_csv("3_4_data.csv", index=False)