# How are the regions at the company complying with the forecast?

## Introduction

**Business Context.** You work at O-I glass inc, the company is wondering how the compliance of the forecast is evolving in february in the regions in which it operates. The following are the regions that need to be analized.

1. APAC.
2. EU.
3. Latin America.
4. North America.

Because the firm is quite large, good strategies to get the compliance of forecast are hard to come by. The leaders of the company are asking for insights related to compliance so they can come up with good strategies to get good results in all regions.

**Business Problem.**  your team lead asks you to investigate the following: **"How is the compliance in each region?"**

**Analytical Context.** The data you've been given is in the Comma Separated Value (CSV) format, and comprises shipped quantities and tonnes for each of the forementioned regions. This case begins with a brief overview of this data, after which you will: (1) learn how to use the Python library ```pandas``` to load the data; (2) use ```pandas``` transform this data into a form amenable for analysis; and finally (3) use ```pandas``` to analyze the above question and come to a conclusion. As you may have guessed, ```pandas``` is an enormously useful library for data analysis and manipulation.

## Importing packages to aid in data analysis

External libraries (a.k.a. packages) are code bases that contain a variety of pre-written functions and tools. This allows you to perform a variety of complex tasks in Python without having to "reinvent the wheel" build everything from the ground up. We will use two core packages: ```pandas``` and ```numpy```.

```pandas``` is an external library that provides functionality for data analysis. Pandas specifically offers a variety of data structures and data manipulation methods that allow you to perform complex tasks with simple, one-line commands.

```numpy``` is a package that we will use later in the case that offers numerous mathematical operations. Together, ```pandas``` and ```numpy``` allow you to create a data science workflow within Python.

Let's import both packages using the ```import``` keyword. We will rename ```pandas``` to ```pd``` and ```numpy``` to ```np``` using the ```as``` keyword. This allows us to use the short name abbreviation when we want to reference any function that is inside either package. The abbreviations we chose are standard across the data science industry and should be followed unless there is a very, very good reason not to.

In [1]:
# Import the Pandas package
import pandas as pd

# Import the NumPy package
import numpy as np

Now that these packages are loaded into Python, we can use their contents. Let's first take a look at ```pandas``` as it has a variety of features we will use to load and analyze our stock data.

## Fundamentals of ```pandas```


At the core of the ```pandas``` library are two fundamental data structures/objects:
1. ```Series```
2. ```DataFrame```

A ```Series``` object stores single-column data along with an **index**. An index is just a way of "numbering" the ```Series``` object. For example, in this case study, the indices will be dates, while the single-column data may be stock prices or daily trading volume.

A ```DataFrame``` object is a two-dimensional tabular data structure with labeled axes. It is conceptually helpful to think of a DataFrame object as a collection of Series objects. Namely, think of each column in a DataFrame as a single Series object, where each of these Series objects shares a common index -  the index of the DataFrame object.

Below is the syntax for creating a Series object, followed by the syntax for creating a DataFrame object. Note that DataFrame objects can also have a single-column – think of this as a DataFrame consisting of a single Series object:

In [2]:
# Create a simple Series object
simple_series = pd.Series(index=[0,1,2,3], name='Volume', data=[1000,2600,1524,98000])
simple_series

0     1000
1     2600
2     1524
3    98000
Name: Volume, dtype: int64

By changing ```pd.Series``` to ```pd.DataFrame```, and adding a columns input list, a DataFrame object can be created:

In [3]:
# Create a simple DataFrame object
simple_df = pd.DataFrame(index=[0,1,2,3], columns=['Volume'], data=[1000,2600,1524,98000])
simple_df

Unnamed: 0,Volume
0,1000
1,2600
2,1524
3,98000


DataFrame objects are more general compared to Series objects. Let's create a two column DataFrame object:

In [4]:
# Create another DataFrame object
another_df = pd.DataFrame(index=[0,1,2,3], columns=['Date','Volume'], data=[[20190101,1000],[20190102,2600],[20190103,1524],[20190104,98000]])
another_df

Unnamed: 0,Date,Volume
0,20190101,1000
1,20190102,2600
2,20190103,1524
3,20190104,98000


Notice how a list of lists was used to specify the data in the ```another_df``` DataFrame. Each element of the list corresponds to a row in the DataFrame, so the list has 4 elements because there are 4 indices. Each element of the list of lists has 2 elements because the DataFrame has two columns.

## Using <code>pandas</code> to analyze forecast data

Recall that we have CSV files that include data for each of the following O-I Glass regions:

1. Asia Pacific.
2. Europe.
3. Latin America.
4. North America.

The available data for each region includes:

1. **Calendar Day:** The shipment date, only includes february current dates
2. **Region:** Region in which the tonnes were shipped
3. **Commercial Forecast Tonnes:** The forecasted value of tonnes to be shipped
4. **Shipped Quantity:** The actual quantity of units shipped
5. **Shipped Tonnes:** The actual number of tonnes shipped

To get a better sense of the available data, let's first take a look at just the data for OI, listed on the Asia Pacific file. You are given a CSV file that contains the company's shipment data, ```apac.csv```. Pandas allows easy loading of CSV files through the use of the method ```pd.read_csv()```:

In [5]:
# Load a file as a DataFrame and assign to df
df = pd.read_csv('data/apac.csv')

The contents of the file ```apac.csv``` are now stored in the DataFrame object ```df```.

There are several common methods and attributes available to take a peek at the data and get a sense of it:

1. ```DataFrame.head()```  -> returns the column names and first 5 rows by default
2. ```DataFrame.tail()```  -> returns the column names and last 5 rows by default
3. ```DataFrame.shape```   -> returns (num_rows, num_columns)
4. ```DataFrame.columns``` -> returns index of columns
5. ```DataFrame.index```   -> returns index of rows

Using ```df.head()``` and ```df.tail()``` we can take a look at the data contents. Unless specified otherwise, Pandas Series and DataFrame objects have indicies starting at 0 and increase monotonically upward along the integers.

In [6]:
# Look at the head of the DataFrame (i.e. the top rows of the DataFrame)
df.head()

Unnamed: 0,Calendar Day,Commercial Forecast Tonnes,Shipped Quantity,Shipped Tonnes
0,2/1/2020,0.0,502728,76.1
1,2/2/2020,0.0,3151926,569.794
2,2/3/2020,3778.204283,12879518,3030.398
3,2/4/2020,3773.625591,17563183,3959.278
4,2/5/2020,3779.526197,16341120,3582.866


In [7]:
# Look at the tail of the DataFrame (i.e. the top rows of the DataFrame)
df.tail()

Unnamed: 0,Calendar Day,Commercial Forecast Tonnes,Shipped Quantity,Shipped Tonnes
14,2/15/2020,0.0,4036324,884.432
15,2/16/2020,0.0,4697999,858.992
16,2/17/2020,3779.526197,18806067,4598.8002
17,2/18/2020,3770.926784,19919523,4739.592
18,2/19/2020,3779.526197,13055177,3182.653


Thus, we see there are 19 data entries (each with 4 data points) for OI. The shape of a DataFrame is accessed using the ```shape``` attribute:

In [8]:
# Determine the shape of the two-dimensional structure, that is (num_rows, num_columns)
df.shape

(19, 4)

It's important to note that ```DataFrame.columns``` and ```DataFrame.index``` return an index object instead of a list. To cast an index to a list for further list manipulation, we use the ```list()``` method:

In [9]:
# List of the column names of the DataFrame
list(df.columns)

['Calendar Day',
 'Commercial Forecast Tonnes',
 'Shipped Quantity',
 'Shipped Tonnes']

In [10]:
# List of the column names of the DataFrame
list(df.index)[0:10] # only showing first 10 index values so reduce screen output

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

## Creating additional variables relevant to forecast compliance

Oftentimes, the data provided to you will not be sufficient to achieve your goal. You may have to add additional variables or data features to assist you. Recall that our original question concerned the compliance within the different regions where the company operates. Therefore, our DataFrame must have features related to these quantities.

It can be helpful to think about adding columns to DataFrames as adding adjacent columns one-by-one in Excel. Here is an example of how to do it:

In [11]:
# Add a new column named "Region"
df['Region'] = 'APAC'
df.head()

Unnamed: 0,Calendar Day,Commercial Forecast Tonnes,Shipped Quantity,Shipped Tonnes,Region
0,2/1/2020,0.0,502728,76.1,APAC
1,2/2/2020,0.0,3151926,569.794,APAC
2,2/3/2020,3778.204283,12879518,3030.398,APAC
3,2/4/2020,3773.625591,17563183,3959.278,APAC
4,2/5/2020,3779.526197,16341120,3582.866,APAC


In [12]:
# We can access a column by using [] brackets and the column name
df['Shipped Tonnes'].head() # added .head() to suppress output

0      76.100
1     569.794
2    3030.398
3    3959.278
4    3582.866
Name: Shipped Tonnes, dtype: float64

In [13]:
# Add a new column named "Quantity_Thousands", which is calculated from the Shipped Quantity column currently in df
df['Quantity_Thousands'] = df['Shipped Quantity'] / 1000.0 # divide every row in df['Shipped Quantity'] by 1 thousand, store in new column
df.head()

Unnamed: 0,Calendar Day,Commercial Forecast Tonnes,Shipped Quantity,Shipped Tonnes,Region,Quantity_Thousands
0,2/1/2020,0.0,502728,76.1,APAC,502.728
1,2/2/2020,0.0,3151926,569.794,APAC,3151.926
2,2/3/2020,3778.204283,12879518,3030.398,APAC,12879.518
3,2/4/2020,3773.625591,17563183,3959.278,APAC,17563.183
4,2/5/2020,3779.526197,16341120,3582.866,APAC,16341.12


In [14]:
# Take a look at the updated DataFrame shape. Two new columns have been added.
df.shape

(19, 6)

## Exercise 1

As discussed, we need to have a feature in our DataFrame that is related to compliance. Because this currently does not exist, we must create it from the already available features. Recall that compliance is the division between the Shipped Tonnes and the Forecast, save the results in a new column called *Compliance* and print the dataframe head:

In [15]:
#possible solution
df['Compliance'] = (df['Shipped Tonnes'] / df['Commercial Forecast Tonnes'])
df.head()

Unnamed: 0,Calendar Day,Commercial Forecast Tonnes,Shipped Quantity,Shipped Tonnes,Region,Quantity_Thousands,Compliance
0,2/1/2020,0.0,502728,76.1,APAC,502.728,inf
1,2/2/2020,0.0,3151926,569.794,APAC,3151.926,inf
2,2/3/2020,3778.204283,12879518,3030.398,APAC,12879.518,0.802074
3,2/4/2020,3773.625591,17563183,3959.278,APAC,17563.183,1.049197
4,2/5/2020,3779.526197,16341120,3582.866,APAC,16341.12,0.947967


Here we see the power of ```pandas```. We can simply perform mathematical operations on columns of DataFrames just as if the DataFrames were single variables themselves.

Now we have features relevant to the original question, and can proceed to the analysis step. A common first step in data analysis is to learn about the distribution of the available data. We will do this next.

## Learning about the data distribution through summary statistics

Let's aggregate summary statistics for the four regions of the company. Fortunately, the DataFrame and Series objects offer a myriad of data summary statistics methods:

1. ```min()```
2. ```median()```
3. ```mean()```
4. ```max()```
5. ```quantile()```

Below, each method is used on the ```Shipped Tonnes``` column. Notice how simple the functions are to apply to the DataFrame. Simply type the name of the DataFrame, followed by a ```.``` and then the method name you'd like to calculate. We've chosen to select a single column ```Shipped Tonnes``` from the DataFrame ```df```, but you could have just as easily called these methods on the full DataFrame rather than a single column:

In [16]:
# Calculate the minimum of the Shipped Tonnes column
df['Shipped Tonnes'].min()

76.1

In [17]:
# Calculate the median of the Shipped Tonnes column
df['Shipped Tonnes'].median()

3030.3979999999997

In [18]:
# Calculate the average of the Shipped Tonnes column
df['Shipped Tonnes'].mean()

2559.627536842105

In [19]:
# Calculate the maximum of the Shipped Tonnes column
df['Shipped Tonnes'].max()

4739.592000000001

We'd also like to explore the data distribution at a more granular level to see how the distribution looks beyond the simple summary statistics presented above. For this, we can use the ```quantile()``` method. The ```quantile()``` method will return the value which represents the given percentile of all the data under study (in this case, of the ```Shipped Tonnes``` data):

In [20]:
# Calculate the 25th percentile
df['Shipped Tonnes'].quantile(0.25)

871.712

In [21]:
# Calculate the 75th percentile
df['Shipped Tonnes'].quantile(0.75)

3684.4965

Is there a more efficient method to quickly compute all of these summary statistics? Yes. One incredibly useful method that combines these summary statistics and also adds a couple others is the ```describe()``` method:

In [22]:
df['Shipped Tonnes'].describe()

count      19.000000
mean     2559.627537
std      1548.040267
min        76.100000
25%       871.712000
50%      3030.398000
75%      3684.496500
max      4739.592000
Name: Shipped Tonnes, dtype: float64

### Exercise 2:

Determine the 25th, 50th, and 75th percentile for the ```Commercial Forecast Tonnes```and ```Shipped Quantity``` columns of ```df```.

**Answer.** One possible solution is indicated below:

In [23]:
# One possible solution
print(df['Shipped Quantity'].describe())
print(df['Commercial Forecast Tonnes'].describe())

count    1.900000e+01
mean     1.112576e+07
std      6.427885e+06
min      5.027280e+05
25%      4.367162e+06
50%      1.287952e+07
75%      1.636095e+07
max      1.991952e+07
Name: Shipped Quantity, dtype: float64
count      19.000000
mean     2572.864132
std      1796.092453
min         0.000000
25%         0.000000
50%      3770.926784
75%      3779.526197
max      3779.526197
Name: Commercial Forecast Tonnes, dtype: float64


## Aggregating data from multiple regions

So far, we've only been looking at data from one of our four regions. Let's go ahead and combine all four CSV files to analyze the four regions together. This will also reduce the amount of programming work required since the code will be shared across the four regions.

One way to accomplish this aggregation task is to use the ```pd.concat()``` method from ```pandas```. An input into this method may be a list of DataFrames that you'd like to concatenate. We will use a for loop to loop over each region name, load the corresponding CSV file, and then append the result to a list which is later aggregated using ```pd.concat()```. Let's take a look at how this is done.

In [24]:
# Load five csv files into one dataframe
print("Defining region names")
regions_data_to_load = ['apac','eu','la','na']
list_of_df = []

# Loop over all regions names
print(" --- Start loop over regions --- ")
for i in regions_data_to_load:
    print("Processing Region: " + i)
    temp_df = pd.read_csv('data/'+i+'.csv')
    temp_df['Region'] = i # ADD NEW COLUMN WITH REGION NAME TO DISTINGUISH IN FINAL DATAFRAME
    list_of_df.append(temp_df)

print(" --- Complete loop over regions --- ")
    
# Combine into a single DataFrame by using concat
print("Aggregating Data")
agg_df = pd.concat(list_of_df, axis=0)

# Add salient statistics for Compliance
print('Calculating Salient Features')
agg_df['Compliance'] = (agg_df['Shipped Tonnes'] / agg_df['Commercial Forecast Tonnes'])

print("agg_df DataFrame shape (rows, columns): ")
print(agg_df.shape)

print("Head of agg_df DataFrame: ")
agg_df.head()

Defining region names
 --- Start loop over regions --- 
Processing Region: apac
Processing Region: eu
Processing Region: la
Processing Region: na
 --- Complete loop over regions --- 
Aggregating Data
Calculating Salient Features
agg_df DataFrame shape (rows, columns): 
(76, 6)
Head of agg_df DataFrame: 


Unnamed: 0,Calendar Day,Commercial Forecast Tonnes,Shipped Quantity,Shipped Tonnes,Region,Compliance
0,2/1/2020,0.0,502728.0,76.1,apac,inf
1,2/2/2020,0.0,3151926.0,569.794,apac,inf
2,2/3/2020,3778.204283,12879518.0,3030.398,apac,0.802074
3,2/4/2020,3773.625591,17563183.0,3959.278,apac,1.049197
4,2/5/2020,3779.526197,16341120.0,3582.866,apac,0.947967


After the for loop, we've aggregated and added the relevant features we identified in the previous section. We then printed the head of the aggregated DataFrame to have a peek at the format of the data, and we've also printed the shape of the DataFrame. This is to sanity check that our final DataFrame is roughly what we expect. Notice the aggregated DataFrame has the same number of columns as the original single region (APAC) data, however the number of rows have increased. This makes sense, because each additional region contains its own data entries. So, this passes our sanity check.

Now, if we want to reverse this process and extract the data relevant to a single stock symbol from the aggregated DataFrame ```agg_df```, we can do so using the ```==``` operator, which returns True when two objects contain the same value, and False otherwise:

In [25]:
region_LA_df = agg_df[agg_df['Region'] == 'la']
region_LA_df.head()

Unnamed: 0,Calendar Day,Commercial Forecast Tonnes,Shipped Quantity,Shipped Tonnes,Region,Compliance
0,2/1/2020,6122.58216,17156234.0,4282.418072,la,0.699446
1,2/2/2020,1768.405546,2872543.0,852.957793,la,0.482332
2,2/3/2020,8708.353228,22246211.0,5453.518169,la,0.62624
3,2/4/2020,8708.353228,27522634.0,7130.092596,la,0.818765
4,2/5/2020,8708.353228,30240505.0,7652.357294,la,0.878738


Looking at the code block above, we've filtered out the rows that correspond to each region. Namely,

```python
agg_df['Region'] == 'la'
```
returns a boolean series of the same number of rows of ```agg_df```, where each value is ```True``` or ```False``` depending on whether a specific row's ```Region``` values is equal to ```'la'```.

This row extraction technique will be useful to us later in this case when we perform analyses on each individual region.

### Exercise 3:

Write code to write a for loop to loop through each of the four regions, extract only the rows correpsonding to each region, and calculate and print the average ```Shipped Quantity``` value for each of the four regions.

**Answer.** One possible solution is indicated below:

In [26]:
# One possible solution
region_list = ['apac','eu','la','na']

for i in region_list:
    print(i)
    region_df = agg_df[agg_df['Region'] == i]
    region_avg_shippedquantity = region_df['Shipped Quantity'].mean()
    print(region_avg_shippedquantity)

apac
11125755.0
eu
43376028.94736842
la
25502723.736842107
na
23264635.606315788


## Analyzing each region Compliance levels

```pandas``` offers the ability to group related rows of DataFrames according to the values of other rows. This useful feature is accomplished using the ```groupby()``` method.  Let's take a look and see how this can be used to group rows so that each group corresponds to a single region:

In [27]:
# Use the groupby() method, notice a DataFrameGroupBy object is returned
agg_df.groupby('Region')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000025E19157CF8>

Here, the ```DataFrameGroupBy``` object can be most readily thought of as containing a DataFrame object for every group (in this case, a DataFrame object for each region). Specifically, each item of the object is a tuple, containing the group identifier (in this case the Region), and the corresponding rows of the DataFrame that have that Region).

Fortunately, ```pandas``` allows you to iterate over the groupby object to see what's inside:

In [28]:
grp_obj = agg_df.groupby('Region') # Group data in agg_df by Region

# Loop through groups
for item in grp_obj:
    print(" ------ Loop Begins ------ ")
    print(type(item))     # Showing type of the item in grp_obj
    print(item[0])        # Region
    print(item[1].head()) # DataFrame with data for the Region
    print(" ------ Loop Ends ------ ")

 ------ Loop Begins ------ 
<class 'tuple'>
apac
  Calendar Day  Commercial Forecast Tonnes  Shipped Quantity  Shipped Tonnes  \
0     2/1/2020                    0.000000          502728.0          76.100   
1     2/2/2020                    0.000000         3151926.0         569.794   
2     2/3/2020                 3778.204283        12879518.0        3030.398   
3     2/4/2020                 3773.625591        17563183.0        3959.278   
4     2/5/2020                 3779.526197        16341120.0        3582.866   

  Region  Compliance  
0   apac         inf  
1   apac         inf  
2   apac    0.802074  
3   apac    1.049197  
4   apac    0.947967  
 ------ Loop Ends ------ 
 ------ Loop Begins ------ 
<class 'tuple'>
eu
  Calendar Day  Commercial Forecast Tonnes  Shipped Quantity  Shipped Tonnes  \
0     2/1/2020                     0.00000         2802197.0         790.578   
1     2/2/2020                     0.00000         1266092.0         356.352   
2     2/3/2020     

Let's combine the ```pd.groupby()``` method with the ```describe()``` method and apply it to each region to analyze the distribution of compliance related features for each region.

In [29]:
grp_obj = agg_df.groupby('Region') # Group data in agg_df by Region

# Loop through groups
for item in grp_obj:
    print('------Region: ', item[0])
    grp_df = item[1]
    relevant_df = grp_df[['Compliance']]
    print(relevant_df.describe())

------Region:  apac
       Compliance
count   19.000000
mean          inf
std           NaN
min      0.623686
25%      0.882408
50%      1.023103
75%           inf
max           inf
------Region:  eu
       Compliance
count   19.000000
mean          inf
std           NaN
min      0.000000
25%      0.996139
50%      1.037899
75%           inf
max           inf
------Region:  la
       Compliance
count   19.000000
mean     0.948708
std      0.409703
min      0.045944
25%      0.812527
50%      0.924755
75%      1.056716
max      2.006248
------Region:  na
       Compliance
count   19.000000
mean          inf
std           NaN
min      0.003251
25%      0.867212
50%      0.935507
75%           inf
max           inf


One immediate observation of note is that the compliance level can vary widely and some regions tend to infinite. There are many reasons for this, a probable hypothesis is that we weren't expecting sales on these days but still got shipped tonnes.

Another observation is that all regions got compliance levels over 80%.

While this is great to see, there is a more powerful way to display this data in pandas. We can call the ```describe()``` method directly on the ```DataFrameGroupBy``` object. This one line allows you to avoid having to write a for loop every time you'd like to summarize data:

In [30]:
# Compliance
agg_df[['Region','Compliance']].groupby('Region').describe()

Unnamed: 0_level_0,Compliance,Compliance,Compliance,Compliance,Compliance,Compliance,Compliance,Compliance
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
apac,19.0,inf,,0.623686,0.882408,1.023103,inf,inf
eu,19.0,inf,,0.0,0.996139,1.037899,inf,inf
la,19.0,0.948708,0.409703,0.045944,0.812527,0.924755,1.056716,2.006248
na,19.0,inf,,0.003251,0.867212,0.935507,inf,inf


This data is identical to the data previously outputted using the for loop approach. The difference is that utilizing the features of the ```DataFrameGroupBy``` object allows for easy coding, fast results, and a clean output. This illustrates the power of using the ```pd.groupby()``` method: generating statistics for groups of interest in your data is straightforward and efficient to code.

## Labelling data points as complying and not complying

Now that we've determined that the compliance levels of each region can vary widely per day, the next logical step is to group periods of complied, high compliance and low compliance to identify days with low compliance.

However, we don't currently have a column that identifies when compliance is high and when it is low. Therefore, we must create a new column called ```Status``` using a threshold. For example, we'd like to have a new column value determined by:

```

if Compliance > treshold:
    Status= 'High Compliance'
else:
    Status = 'Low Compliance'
```

Here we will define low compliance levels by any ```Compliance``` below the 50%th percentile. High compliance is over the 50%th percentile.
Let's take a look how we can accomplish this task using ```groupby()``` functionality and the ```quantile()``` method, which returns the percentile for a given series of data:

In [31]:
# Determine lower thresholds for volatility for each symbol
status_thresholds = agg_df.groupby('Region')['Compliance'].quantile(0.5) # 50th percentile (median)
print(status_thresholds)

Region
apac    1.023103
eu      1.037899
la      0.924755
na      0.935507
Name: Compliance, dtype: float64


Since we'd like to label periods of high and low volatility by symbol, we will make use of the ```np.where()``` method in the ```numpy``` library. This method takes an input and checks a logical condition: if the condition is true, it will return its second argument, whereas if the condition is false, it will return its third argument. This is very similar to how Microsoft Excel's ```IFERROR()``` method works (helpful to think of it this way for those familiar with Excel). Let's loop through each symbol and label each day as either high and low volatility:

In [32]:
# Loop through regions
print("Defining regions")
list_of_regions= ['apac','eu','la','na']
list_of_df = []

# Loop over all regions
print(" --- Loop over symbols --- ")
for i in list_of_regions:
    print("Labelling compliance for region: " + i)
    temp_df = agg_df[agg_df['Region'] == i].copy() # make a copy of the dataframe to ensure not affecting agg_df
    volstat_t = status_thresholds.loc[i]

    temp_df['Status'] = np.where(temp_df['Compliance'] < volstat_t, 'Low Compliance', 'High Compliance') # Compliance label
    list_of_df.append(temp_df)
    
print(" --- Completed loop over regions --- ")

print("Aggregating data")
labelled_df = pd.concat(list_of_df)

Defining regions
 --- Loop over symbols --- 
Labelling compliance for region: apac
Labelling compliance for region: eu
Labelling compliance for region: la
Labelling compliance for region: na
 --- Completed loop over regions --- 
Aggregating data


In [33]:
labelled_df.head()

Unnamed: 0,Calendar Day,Commercial Forecast Tonnes,Shipped Quantity,Shipped Tonnes,Region,Compliance,Status
0,2/1/2020,0.0,502728.0,76.1,apac,inf,High Compliance
1,2/2/2020,0.0,3151926.0,569.794,apac,inf,High Compliance
2,2/3/2020,3778.204283,12879518.0,3030.398,apac,0.802074,Low Compliance
3,2/4/2020,3773.625591,17563183.0,3959.278,apac,1.049197,High Compliance
4,2/5/2020,3779.526197,16341120.0,3582.866,apac,0.947967,Low Compliance


We've now added a ```Status``` column that identifies whether each Region is in a period of high or low compliance.

### Exercise 4:

Write code to group time periods into Low, Medium, or High, where:

```
if Compliance > (75th percentile compliance for given region):
    Status = 'HIGH'
elif  Compliance > (25th percentile compliance for given region):
    Status = 'MEDIUM'
else:
    Status = 'LOW'
```

Output a ```final_df``` DataFrame output grouped by Region, showing the mean Compliance for each Status category.

**Answer.** One possible solution is shown below:

In [34]:
# One possible solution
# Determine thresholds for compliance for each region
compliance_thresholds_75 = agg_df.groupby('Region')['Compliance'].quantile(0.75) # 75th percentile 
compliance_thresholds_25 = agg_df.groupby('Region')['Compliance'].quantile(0.25) # 25th percentile 

# Loop through regions
print("Defining Regions")
list_of_Regions= ['apac','eu','la','na']
list_of_df = []

# Loop over all regions
print(" --- Loop over Regions --- ")
for i in list_of_Regions:
    print("Labelling compliance for region: " + i)
    temp_df = agg_df[agg_df['Region'] == i].copy() # make a copy of the dataframe to ensure not affecting agg_df
    compliance_t75 = compliance_thresholds_75.loc[i]
    compliance_t25 = compliance_thresholds_25.loc[i]
    
    temp_df['Status'] = np.where(temp_df['Compliance'] > compliance_t75, 'HIGH',
                                  np.where(temp_df['Compliance'] > compliance_t25, 'MEDIUM','LOW')) # compliance label
    list_of_df.append(temp_df)
    
print(" --- Completed loop over Regions --- ")

print("Aggregating data")
final_df = pd.concat(list_of_df)
print(final_df.groupby(['Region','Status'])[['Shipped Tonnes']].mean())

Defining Regions
 --- Loop over Regions --- 
Labelling compliance for region: apac
Labelling compliance for region: eu
Labelling compliance for region: la
Labelling compliance for region: na
 --- Completed loop over Regions --- 
Aggregating data
               Shipped Tonnes
Region Status                
apac   LOW        2773.097800
       MEDIUM     2483.388157
eu     LOW       14762.047600
       MEDIUM    11840.703929
la     HIGH       7032.844024
       LOW        3602.087943
       MEDIUM     7531.170791
na     LOW        6472.032093
       MEDIUM     5987.632267


### Exercise 5:

Write a script to find and print the day that has the highest shipped tonnes for each Region. 

**Answer.** One possible solution is shown below:

In [35]:
# One possible solution

# Add a column for the day
day_list = []
for i in agg_df['Calendar Day']:
    day_list.append(i[2:4])
    
agg_df['day'] = day_list

# Group by Region, then loop through the group object to group by day and calculate shipped tonnes
grp = agg_df.groupby('Region')
for item in grp:
    print('------Region: ', item[0])
    grp_df = item[1]
    grp_df.head()
    relevant_df = grp_df[['day','Shipped Tonnes']]
    day_df = relevant_df.groupby('day').sum()
    
    max_volume = float(day_df.max())
    print(day_df[day_df['Shipped Tonnes'] == max_volume])

------Region:  apac
     Shipped Tonnes
day                
18         4739.592
------Region:  eu
     Shipped Tonnes
day                
11        20810.462
------Region:  la
     Shipped Tonnes
day                
12       9681.34499
------Region:  na
     Shipped Tonnes
day                
10      11050.57511


## Takeaways

In this case, we've learned the foundations of the ```pandas``` library in Python. We now know how to:

1. Read data from files
2. Aggregate and manipulate data using ```pandas```
3. Analyze summary statistics and gather information from trends across time

Going forward, we will be able to use ```pandas``` as a data analysis framework to build more complex projects and solve critical business problems.

### Extended read

#### Pandas Cheat Sheet

https://owensillinois.sharepoint.com/:b:/r/teams/DS4OI-One/Shared%20Documents/General/Resources/Data%20Wrangling%20with%20pandas%20.pdf?csf=1&e=s2fgrh

#### Pandas practice


https://www.w3resource.com/python-exercises/pandas/index.php

#### Pandas merging 101


https://stackoverflow.com/questions/53645882/pandas-merging-101