# How are the historic sales of O-I?

Import the necessary packages


In [1]:
import pandas as pd #to work with pandas dateframes
import numpy as np
import datetime # library to work and transform dates

# Introduction

**Business Problem**: Your task is to format the given data and provide insights of the historic sales of O-I since 2015.

**Anaytical Context**: You are given a CSV file containing details about the sales of each global region, region, Country, regional sales parent and color, month by month from 2015 to 2020. The delimiter in the given CSV file is `;` instead of the default `,`. you will be performing the following tasks on the data:

1. Read, transform, and prepare data to answer questions asked by the business leaders
2. Perform analytics of the data to identify patterns in the dataset

The client has a specific set of questions they would like to get answers to:

1. What is the discount % ,if any, given on each sale? add a column with the calculation. 
2. What is the greatest discount ever given, to what company and when? Which company has gotten the least discount offers?, if tied bring the one with more history with O-I.
3. What is the average of all numberic values for each country by year?
4. Which companies buy more tonnes, which ones buy the least?. Classify them with `HOT` and `COLD`
5. What are the colors most sold in the top 10 companies? filter out `NOT ASSIGNED`.
6. On what month can we find a peak in sales?
7. What is the average price per TO for each company? which one has the highest price per TO? if it's `NOT ASSIGNED` get the next `Company Name`.





### Fetch information from the CSV file

Read the file (`Remember, the delimiter is ";"`) and go through the columns present in the dataframe

In [2]:
raw_df = pd.read_csv('Global Sales.csv', sep=';')

In [3]:
raw_df.shape

(204867, 13)

In [4]:
raw_df.groupby('Region').count()

Unnamed: 0_level_0,Unnamed: 0,Color,Global Region,Product Category,Country,Sub-region,Sales (USD),Negotiated Discount,Quantity in TO,Calendar Month/Year,Company,","
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
APAC,27458,17786,27447,27458,27458,27447,27243,27458,26737,27458,27458,0
APAC-JV,922,0,922,922,922,922,922,922,922,922,922,0
EU,80512,69472,80512,80512,80512,80512,80397,80512,80110,80512,80512,0
IVC,71,0,71,71,71,71,71,71,71,71,71,0
Latin America,59824,36787,59823,59824,59824,59823,59821,59824,59265,59824,59824,0
North America,35806,28631,35806,35806,35806,35806,35408,35806,30962,35806,35806,0


### Format the date and drop unnecessary columns in the dataframe

To format the date use datetime and a lambda function.

## Lambda/Anonymous functions

In Python, a lambda or anonymous function is a function that is defined without a name. While normal functions are defined using the `def` keyword, in Python anonymous functions are defined using the `lambda` keyword. Hence, anonymous functions are also called lambda functions.

A lambda function in python has the following syntax:

`lambda arguments: expression`

Lambda functions can have any number of arguments but only one expression. The expression is evaluated and returned. Lambda functions can be used wherever function objects are required.

## Datetime

In [5]:
# tests
stringdate = "201512"
dt = datetime.datetime.strptime(stringdate, "%Y%m")
print("returned tuple: %s " % dt)

returned tuple: 2015-12-01 00:00:00 


Python has a module named datetime to work with dates and times, the most commonly used classes in the datetime module are:

-date Class
-time Class
-datetime Class
-timedelta Class

This library allows the manipulation of dates in various ways. In this case we are going to use it to convert the `Calendar Month/Year` column into datetime. Using the strptime() method and a lambda function, we are going to convert the string values in the column to a datetime object and add it to another column called `dt`.

In [7]:
raw_df['dt'] = raw_df.apply(lambda row: pd.to_datetime (row['Calendar Month/Year'], format='%Y%m'), axis=1)

Print the head of the DataFrame to verify the column is created.

In [6]:
raw_df.head()

Unnamed: 0.1,Unnamed: 0,Color,Global Region,Product Category,Region,Country,Sub-region,Sales (USD),Negotiated Discount,Quantity in TO,Calendar Month/Year,Company,","
0,0,Flint,EU,Food,EU,Czech Republic,ECE,719463.38,57557.07,1544.32,201512,"NOT ASSIGNED,",
1,1,Flint,EU,Nab,EU,Czech Republic,ECE,5339.4,640.73,9.95,201501,"NOT ASSIGNED,",
2,2,Flint,EU,Nab,EU,Czech Republic,ECE,30341.28,1213.65,61.25,201508,"NOT ASSIGNED,",
3,3,Flint,EU,Nab,EU,Czech Republic,ECE,8814.72,705.18,17.04,201502,"NOT ASSIGNED,",
4,4,Flint,EU,Wine,EU,Czech Republic,ECE,70702.15,9898.3,156.62,201502,"NOT ASSIGNED,",


Now create a column with the year of each sale.

In [8]:
raw_df['year'] = raw_df.apply(lambda row: row['dt'].year, axis=1)

In [9]:
raw_df['year']

0         2015
1         2015
2         2015
3         2015
4         2015
          ... 
204862    2019
204863    2019
204864    2019
204865    2019
204866    2018
Name: year, Length: 204867, dtype: int64

Now create a column with the month of each sale.

In [10]:
raw_df['month'] = raw_df.apply(lambda row: row['dt'].month , axis=1)

In [11]:
raw_df['month']

0         12
1          1
2          8
3          2
4          2
          ..
204862     6
204863     7
204864    11
204865    10
204866    12
Name: month, Length: 204867, dtype: int64

Finally, drop unnecessary columns.

In [12]:
raw_df.drop(['Calendar Month/Year','Sub-region','Global Region',','], axis=1)

Unnamed: 0.1,Unnamed: 0,Color,Product Category,Region,Country,Sales (USD),Negotiated Discount,Quantity in TO,Company,dt,year,month
0,0,Flint,Food,EU,Czech Republic,719463.38,57557.07,1544.32,"NOT ASSIGNED,",2015-12-01,2015,12
1,1,Flint,Nab,EU,Czech Republic,5339.40,640.73,9.95,"NOT ASSIGNED,",2015-01-01,2015,1
2,2,Flint,Nab,EU,Czech Republic,30341.28,1213.65,61.25,"NOT ASSIGNED,",2015-08-01,2015,8
3,3,Flint,Nab,EU,Czech Republic,8814.72,705.18,17.04,"NOT ASSIGNED,",2015-02-01,2015,2
4,4,Flint,Wine,EU,Czech Republic,70702.15,9898.30,156.62,"NOT ASSIGNED,",2015-02-01,2015,2
...,...,...,...,...,...,...,...,...,...,...,...,...
204862,204862,Flint,Nab,Latin America,Mexico,953629.13,114435.50,1996.77,"NOT ASSIGNED,",2019-06-01,2019,6
204863,204863,Flint,Nab,Latin America,Mexico,833829.93,83382.99,1741.27,"NOT ASSIGNED,",2019-07-01,2019,7
204864,204864,Flint,Nab,Latin America,Mexico,559864.33,61585.08,1208.83,"NOT ASSIGNED,",2019-11-01,2019,11
204865,204865,Flint,Nab,Latin America,Mexico,709759.43,21292.78,1515.25,"NOT ASSIGNED,",2019-10-01,2019,10


In [13]:
clean_df  = raw_df.drop(['Calendar Month/Year','Sub-region','Global Region',','], axis=1)

In [14]:
clean_df

Unnamed: 0.1,Unnamed: 0,Color,Product Category,Region,Country,Sales (USD),Negotiated Discount,Quantity in TO,Company,dt,year,month
0,0,Flint,Food,EU,Czech Republic,719463.38,57557.07,1544.32,"NOT ASSIGNED,",2015-12-01,2015,12
1,1,Flint,Nab,EU,Czech Republic,5339.40,640.73,9.95,"NOT ASSIGNED,",2015-01-01,2015,1
2,2,Flint,Nab,EU,Czech Republic,30341.28,1213.65,61.25,"NOT ASSIGNED,",2015-08-01,2015,8
3,3,Flint,Nab,EU,Czech Republic,8814.72,705.18,17.04,"NOT ASSIGNED,",2015-02-01,2015,2
4,4,Flint,Wine,EU,Czech Republic,70702.15,9898.30,156.62,"NOT ASSIGNED,",2015-02-01,2015,2
...,...,...,...,...,...,...,...,...,...,...,...,...
204862,204862,Flint,Nab,Latin America,Mexico,953629.13,114435.50,1996.77,"NOT ASSIGNED,",2019-06-01,2019,6
204863,204863,Flint,Nab,Latin America,Mexico,833829.93,83382.99,1741.27,"NOT ASSIGNED,",2019-07-01,2019,7
204864,204864,Flint,Nab,Latin America,Mexico,559864.33,61585.08,1208.83,"NOT ASSIGNED,",2019-11-01,2019,11
204865,204865,Flint,Nab,Latin America,Mexico,709759.43,21292.78,1515.25,"NOT ASSIGNED,",2019-10-01,2019,10


### What is the discount % ,if any, given on each sale? add a column with the calculation.

Add the column and show the first 20 rows.

In [15]:
clean_df['% discount'] = clean_df.apply(lambda row: row['Negotiated Discount']/row['Sales (USD)']*100 if row['Sales (USD)'] else 0, axis=1)

Print the first 20 rows

In [16]:
clean_df[0:20]

Unnamed: 0.1,Unnamed: 0,Color,Product Category,Region,Country,Sales (USD),Negotiated Discount,Quantity in TO,Company,dt,year,month,% discount
0,0,Flint,Food,EU,Czech Republic,719463.38,57557.07,1544.32,"NOT ASSIGNED,",2015-12-01,2015,12,8.0
1,1,Flint,Nab,EU,Czech Republic,5339.4,640.73,9.95,"NOT ASSIGNED,",2015-01-01,2015,1,12.000037
2,2,Flint,Nab,EU,Czech Republic,30341.28,1213.65,61.25,"NOT ASSIGNED,",2015-08-01,2015,8,3.999996
3,3,Flint,Nab,EU,Czech Republic,8814.72,705.18,17.04,"NOT ASSIGNED,",2015-02-01,2015,2,8.000027
4,4,Flint,Wine,EU,Czech Republic,70702.15,9898.3,156.62,"NOT ASSIGNED,",2015-02-01,2015,2,13.999999
5,5,Flint,Wine,EU,Czech Republic,124935.62,9994.85,259.84,"NOT ASSIGNED,",2015-03-01,2015,3,8.0
6,6,Flint,Nab,EU,Czech Republic,28471.65,854.15,58.09,"NOT ASSIGNED,",2015-07-01,2015,7,3.000002
7,7,Flint,Drug & chemical,EU,Czech Republic,17006.82,1190.48,11.06,"NOT ASSIGNED,",2015-09-01,2015,9,7.000015
8,8,Flint,Spirits,EU,Czech Republic,173605.15,15624.46,380.79,"NOT ASSIGNED,",2015-05-01,2015,5,8.999998
9,9,Flint,Wine,EU,Czech Republic,115043.96,0.0,228.6,"NOT ASSIGNED,",2015-12-01,2015,12,0.0


### What is the greatest discount ever given, to what company and when? Which company has gotten the lowest discount?

In [17]:
clean_df['% discount'].describe()

count    204136.000000
mean          6.854534
std           4.887646
min           0.000000
25%           2.000007
50%           6.999999
75%          11.000000
max          16.666667
Name: % discount, dtype: float64

Print information about the greatest discount.

In [18]:
max_discount = clean_df['% discount'].max()
df_max_discount = clean_df[clean_df['% discount'] == max_discount]
df_max_discount

Unnamed: 0.1,Unnamed: 0,Color,Product Category,Region,Country,Sales (USD),Negotiated Discount,Quantity in TO,Company,dt,year,month,% discount
82656,82656,,Beer,North America,Canada,0.12,0.02,0.0,"TRICORBRAUN,",2018-01-01,2018,1,16.666667
175758,175758,,Wine,EU,Estonia,0.12,0.02,0.45,"NOT ASSIGNED,",2019-03-01,2019,3,16.666667


Print information about the lowest discount.

In [19]:
min_discount = clean_df['% discount'].min()
df_min_discount = clean_df[clean_df['% discount'] == min_discount]
df_min_discount

Unnamed: 0.1,Unnamed: 0,Color,Product Category,Region,Country,Sales (USD),Negotiated Discount,Quantity in TO,Company,dt,year,month,% discount
9,9,Flint,Wine,EU,Czech Republic,115043.96,0.0,228.60,"NOT ASSIGNED,",2015-12-01,2015,12,0.0
14,14,Flint,Miscellaneous,EU,Czech Republic,90822.42,0.0,109.45,"NOT ASSIGNED,",2015-09-01,2015,9,0.0
23,23,Flint,Miscellaneous,EU,Czech Republic,36908.64,0.0,48.23,"NOT ASSIGNED,",2015-01-01,2015,1,0.0
65,65,Flint,Nab,EU,Czech Republic,18473.37,0.0,34.78,"NOT ASSIGNED,",2015-10-01,2015,10,0.0
82,82,Flint,Miscellaneous,EU,Czech Republic,195431.38,0.0,121.03,"NOT ASSIGNED,",2015-10-01,2015,10,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
204820,204820,Flint,Beer,Latin America,Mexico,0.00,0.0,0.15,"NOT ASSIGNED,",2018-07-01,2018,7,0.0
204831,204831,Flint,Spirits,Latin America,Mexico,1962873.15,0.0,3170.71,"NOT ASSIGNED,",2016-06-01,2016,6,0.0
204837,204837,Flint,Food,Latin America,Mexico,2722056.94,0.0,4423.38,"NOT ASSIGNED,",2018-02-01,2018,2,0.0
204839,204839,Flint,Food,Latin America,Mexico,1802388.38,0.0,3153.11,"NOT ASSIGNED,",2015-12-01,2015,12,0.0


### What is the sales average for each country by year?

In [20]:
clean_df.groupby(['Country','year'])[['Sales (USD)']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales (USD)
Country,year,Unnamed: 2_level_1
Argentina,2015,5.353886e+04
Argentina,2016,7.119739e+04
Argentina,2017,1.391253e+05
Argentina,2018,6.302183e+05
Argentina,2019,3.240988e+05
...,...,...
Vietnam,2016,4.006964e+05
Vietnam,2017,4.147360e+05
Vietnam,2018,3.304969e+06
Vietnam,2019,1.801431e+06


### Descriptive analytics:
When you want to get fast insights, It's useful to see the statistical description of the numeric columns available in the dataset. We can use the function `describe()` to get this information.

Use `describe()` in a `for loop` to get the statistical description of `Quantity in TO` for each region.

In [21]:
grp_obj = clean_df.groupby('Region')

# Loop through countries
for item in grp_obj:
    print('------Region: ', item[0])
    grp_df = item[1]
    #grp_df.head()
    relevant_df = grp_df[['Quantity in TO']]
    year_df = relevant_df.describe()
    print(year_df)

------Region:  APAC
       Quantity in TO
count    26737.000000
mean      1299.486700
std       8093.941751
min       -843.200000
25%         11.470000
50%         82.430000
75%        418.000000
max     275637.080000
------Region:  APAC-JV
       Quantity in TO
count      922.000000
mean      5272.022137
std      12033.002317
min          0.000000
25%         60.930000
50%        541.840000
75%       3959.197500
max      80296.730000
------Region:  EU
       Quantity in TO
count    80110.000000
mean      3105.639167
std      11136.545463
min      -2781.000000
25%         70.830000
50%        404.325000
75%       1782.547500
max     362488.180000
------Region:  IVC
       Quantity in TO
count       71.000000
mean     57182.550000
std      56811.794372
min       8891.610000
25%      15127.250000
50%      43349.130000
75%      60284.390000
max     195130.610000
------Region:  Latin America
       Quantity in TO
count    5.926500e+04
mean     1.926197e+03
std      1.414426e+04
min     -1.

### Which companies buy more tonnes, which ones buy the least?. 
Classify them with `HOT` if they buy more tonnes than the median and `COLD` if they buy less.

In [22]:
clean_df.shape

(204867, 13)

In [23]:
print(clean_df[clean_df['Company'] == 'NOT ASSIGNED,'].shape)
print(clean_df[(clean_df['Company'] ==  'NOT ASSIGNED,') | (clean_df['Company'] ==  ',')].shape)

(53378, 13)
(53391, 13)


In [24]:
print(clean_df[clean_df['Company'] == 'NOT ASSIGNED,'].shape)
assign_cp_df = clean_df[(clean_df['Company'] != 'NOT ASSIGNED,') & (clean_df['Company'] !=  ',')]
assign_cp_df.shape
print("Individual Sales median :", f"{assign_cp_df['Quantity in TO'].median():,}")

group_list_df = assign_cp_df.groupby('Company')['Quantity in TO']
sum_group_list_df = assign_cp_df.groupby('Company')['Quantity in TO'].sum()
type(group_list_df)

(53378, 13)
Individual Sales median : 221.4


pandas.core.groupby.generic.SeriesGroupBy

In [25]:

#print (group_list_df.describe())
cp_sales_median = sum_group_list_df.median()
list_cp_df = []
sum_group_list_df
print("Companies sales median: ", f"{cp_sales_median:,}")
for item in group_list_df:
    #print(" ------ Loop Begins ------ ")

    #print (group_list_df.get_group(key), "\n\n")
    #print(type(item))     # Showing type of the item in grp_obj
    #print(item[0])        # Region
    temp_assign_df = assign_cp_df[assign_cp_df['Company'] == item[0]].copy()     
    qualification = "HOT" if item[1].sum() > cp_sales_median else "COLD" 
    temp_assign_df['Sales Company'] = item[1].sum()
    temp_assign_df['Sales Score'] = qualification
    list_cp_df.append(temp_assign_df)

    #print(" ------ Loop Ends ------ ")

qualified_cp_df = pd.concat(list_cp_df)
qualified_cp_df.head()

Companies sales median:  22,048.350000000006


Unnamed: 0.1,Unnamed: 0,Color,Product Category,Region,Country,Sales (USD),Negotiated Discount,Quantity in TO,Company,dt,year,month,% discount,Sales Company,Sales Score
35333,35333,,NOT ASSIGNED,North America,USA,340.18,37.42,0.15,"(*CINCINNATI CONTAINER COMPANY*),",2018-10-01,2018,10,11.000059,32032.06,HOT
42231,42231,Amber,Drug & chemical,North America,USA,348759.82,24413.19,249.63,"(*CINCINNATI CONTAINER COMPANY*),",2018-01-01,2018,1,7.000001,32032.06,HOT
42232,42232,Amber,Drug & chemical,North America,USA,96745.18,14511.78,74.57,"(*CINCINNATI CONTAINER COMPANY*),",2015-04-01,2015,4,15.000003,32032.06,HOT
42233,42233,Amber,Drug & chemical,North America,USA,275613.91,27561.39,199.36,"(*CINCINNATI CONTAINER COMPANY*),",2017-08-01,2017,8,10.0,32032.06,HOT
42234,42234,Amber,Drug & chemical,North America,USA,165896.61,9953.8,120.04,"(*CINCINNATI CONTAINER COMPANY*),",2017-06-01,2017,6,6.000002,32032.06,HOT


In [26]:
qualified_cp_df

Unnamed: 0.1,Unnamed: 0,Color,Product Category,Region,Country,Sales (USD),Negotiated Discount,Quantity in TO,Company,dt,year,month,% discount,Sales Company,Sales Score
35333,35333,,NOT ASSIGNED,North America,USA,340.18,37.42,0.15,"(*CINCINNATI CONTAINER COMPANY*),",2018-10-01,2018,10,11.000059,32032.06,HOT
42231,42231,Amber,Drug & chemical,North America,USA,348759.82,24413.19,249.63,"(*CINCINNATI CONTAINER COMPANY*),",2018-01-01,2018,1,7.000001,32032.06,HOT
42232,42232,Amber,Drug & chemical,North America,USA,96745.18,14511.78,74.57,"(*CINCINNATI CONTAINER COMPANY*),",2015-04-01,2015,4,15.000003,32032.06,HOT
42233,42233,Amber,Drug & chemical,North America,USA,275613.91,27561.39,199.36,"(*CINCINNATI CONTAINER COMPANY*),",2017-08-01,2017,8,10.000000,32032.06,HOT
42234,42234,Amber,Drug & chemical,North America,USA,165896.61,9953.80,120.04,"(*CINCINNATI CONTAINER COMPANY*),",2017-06-01,2017,6,6.000002,32032.06,HOT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148339,148339,Other,Wine,EU,Italy,11217.34,0.00,28.81,"ZURLA ANTONIO ST. MARIA,",2021-01-01,2021,1,0.000000,476346.15,HOT
148362,148362,Other,Wine,EU,Italy,15558.97,1867.08,41.09,"ZURLA ANTONIO ST. MARIA,",2022-04-01,2022,4,12.000023,476346.15,HOT
148371,148371,Other,Wine,EU,Italy,20177.69,2623.10,51.41,"ZURLA ANTONIO ST. MARIA,",2022-03-01,2022,3,13.000001,476346.15,HOT
148372,148372,Other,Wine,EU,Italy,14482.23,1013.76,35.81,"ZURLA ANTONIO ST. MARIA,",2022-01-01,2022,1,7.000027,476346.15,HOT


Select top 10 in `HOT` and top 10 `COLD` (exclude companies with 0 TO or no category)

In [27]:
hot_qualified_cp_df = qualified_cp_df.sort_values(['Sales Company'], ascending=False)

grp_qualified_cp_df = hot_qualified_cp_df.groupby('Company', sort=False)
grp_qualified_cp_df.first()

Unnamed: 0_level_0,Unnamed: 0,Color,Product Category,Region,Country,Sales (USD),Negotiated Discount,Quantity in TO,dt,year,month,% discount,Sales Company,Sales Score
Company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
"ANHEUSER-BUSCH INBEV,",145693,Amber,Beer,EU,United Kingdom,4386343.71,0.00,8711.69,2021-02-01,2021,2,0.000000,31231947.51,HOT
"HEINEKEN,",38510,Amber,Beer,EU,Switzerland,31384672.67,313846.73,76918.58,2018-07-01,2018,7,1.000000,25098145.92,HOT
"COCA - COLA,",27749,Flint,Spirits,Latin America,Mexico,10756.13,322.68,16.23,2018-10-01,2018,10,2.999964,20797735.59,HOT
"MOLSON-COORS,",63581,Flint,Beer,North America,USA,22933317.21,2751998.07,36268.63,2019-09-01,2019,9,12.000000,13849699.91,HOT
"CARLSBERG,",138817,Green,Beer,EU,Germany,174663.73,3493.27,424.98,2020-09-01,2020,9,1.999997,13342978.55,HOT
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"FC - OI - FAIRFIELD,",203038,,Spirits,Latin America,Colombia,0.00,0.00,0.00,2017-12-01,2017,12,0.000000,0.00,COLD
"GROUP NOT USED HEINZ WATTIE LTD,",198096,Flint,Food,APAC,New Zealand,0.00,0.00,0.00,2018-04-01,2018,4,0.000000,0.00,COLD
"OWENS AMERICA S DE RL DE CV,",202999,Flint,Spirits,Latin America,Colombia,0.00,0.00,0.00,2020-08-01,2020,8,0.000000,0.00,COLD
"SKEPTIC DISTILLERY,",83158,Green,Wine,North America,USA,0.00,0.00,0.00,2016-07-01,2016,7,0.000000,0.00,COLD


In [44]:
#grp_qualified_cp_df = qualified_cp_df[['Company','Sales Company','Sales Score']].groupby('Company', 'Sales Score').count()
hot_qualified_cp_df = qualified_cp_df.sort_values(['Sales Company'], ascending=False)
# hot_qualified_cp_df

# grp_qualified_cp_df = hot_qualified_cp_df[qualified_cp_df['Sales Score'] == 'HOT'].groupby(['Company'])['Sales Company']
grp_qualified_cp_df = hot_qualified_cp_df[qualified_cp_df['Sales Score'] == 'HOT'].groupby(['Company'])
# grp_qualified_cp_df.size()
list_of_top = []

for item in grp_qualified_cp_df:
    
#     #print(" ------ Loop Begins ------ ")
#     # print(type(item))     # Showing type of the item in grp_obj
#     # print(item[0],item[1][['Company','Sales Company']])
    tmp_top_hot = item[1]
    relevant_top_hot = tmp_top_hot[['Company','Sales Company']]
    list_of_top.append(relevant_top_hot)
#     # print()
#     # print(item[2])
#     # print (group_list_df.get_group(key), "\n\n")
#     # print(type(item))     # Showing type of the item in grp_obj
#     # print(item[0])        # Region
len(list_of_top)
# list_of_top
#final_hot_df = pd.DataFrame(list_of_top)
# final_hot_df.head
# final_top = pd.concat(list_of_top_hot)
# final_top

360

In [46]:
final_hot_df = pd.DataFrame(list_of_top)
final_hot_df

Unnamed: 0,0
0,Company Sal...
1,Company Sales Company 131817 A LE ...
2,Company Sales Company...
3,Company Sales Company 60919 ...
4,Company Sal...
...,...
355,Company Sales Compan...
356,Company Sales Company 129872 WIN...
357,Company Sales Company 31927 ...
358,Company Sales Company 445...


How many companies are categorized as `HOT`? How many `COLD`?

In [29]:
hot_qualified_cp_df[qualified_cp_df['Sales Score'] == 'HOT'].groupby(['Company']).count()

Unnamed: 0_level_0,Unnamed: 0,Color,Product Category,Region,Country,Sales (USD),Negotiated Discount,Quantity in TO,dt,year,month,% discount,Sales Company,Sales Score
Company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
"(*CINCINNATI CONTAINER COMPANY*),",261,179,261,261,261,261,261,249,261,261,261,261,261,261
"A LE COQ,",168,144,168,168,168,168,168,168,168,168,168,168,168,168
"A.E. CHAPMAN & SON LTD,",845,785,845,845,845,844,845,844,845,845,845,844,845,845
"ACAITEUA LTDA,",177,171,177,177,177,177,177,177,177,177,177,177,177,177
"ACCOLADE WINES AUSTRALIA LIMITED,",154,128,154,154,154,154,154,147,154,154,154,154,154,154
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"WINE - NBD DISTRIBUTION,",9,0,9,9,9,9,9,9,9,9,9,9,9,9
"WINE EXCEL,",311,275,311,311,311,311,311,311,311,311,311,311,311,311
"WINE GROUP (THE),",320,218,320,320,320,320,320,320,320,320,320,320,320,320
"ZUCKERMAN-HONICKMAN,",312,239,312,312,312,312,312,257,312,312,312,312,312,312


### What are the colors most sold in the top 10 companies?

Define the top 10 companies in order of `Quantity in TO`.

In [30]:
# Code here

Filter out `NOT ASSIGNED` to get the name of the top 10 companies

In [31]:
# Code here

Show top colors in top 10 companies

In [32]:
# Code here

### On what month can we find a peak in quantities sold?

In [33]:
# Code here

### What is the average price per unit for each company? 

Which one has the highest average price per unit? if it's `NOT ASSIGNED` get the next Company Name

First let's add a column called `Price` this column is the division between total sales in USD and quantity. Round the result to 2 decimals.

In [34]:
qualified_cp_df['Price'] = qualified_cp_df.apply(lambda row: round(row['Sales (USD)']/row['Quantity in TO'],2) if row['Quantity in TO'] > 0 else 0, axis=1)

In [35]:
group_list_df = qualified_cp_df.groupby('Company')['Price']
sum_group_list_df = assign_cp_df.groupby('Company')['Quantity in TO'].sum()

Drop infinite and NAN values in the DataFrame.

In [36]:
qualified_cp_df.isnull().sum()


Unnamed: 0                 0
Color                  40082
Product Category           0
Region                    10
Country                   10
Sales (USD)              375
Negotiated Discount        0
Quantity in TO          5569
Company                    0
dt                         0
year                       0
month                      0
% discount               375
Sales Company              0
Sales Score                0
Price                      1
dtype: int64

In [37]:
qualified_cp_df.isnull().mean()*100

Unnamed: 0              0.000000
Color                  26.460958
Product Category        0.000000
Region                  0.006602
Country                 0.006602
Sales (USD)             0.247564
Negotiated Discount     0.000000
Quantity in TO          3.676490
Company                 0.000000
dt                      0.000000
year                    0.000000
month                   0.000000
% discount              0.247564
Sales Company           0.000000
Sales Score             0.000000
Price                   0.000660
dtype: float64

In [38]:
clean_cp_df = qualified_cp_df.dropna().copy()

What is the company with the greatest average price? remember `NOT ASSINED` is not an acceptable answer.

In [39]:
# group_price_df = qualified_cp_df.groupby('Company')['Price']
mean_group_price_df = qualified_cp_df.groupby('Company')['Price'].mean()

In [373]:
# mean_group_price_df
mean_group_price_df.groupby('Company').max()

Company
(*CINCINNATI CONTAINER COMPANY*),    1235.089425
12 SPIES VINEYARDS,                   267.283976
5280PKG,                              495.806406
96 WEST WINERY,                       700.030000
A LE COQ,                             393.748810
                                        ...     
WOLF'S RIDGE BREWING,                 436.822759
WYMORE WAREHOUSE,                       0.000000
ZOILO RUIZ MATEOS, S.L.               439.015161
ZUCKERMAN-HONICKMAN,                  550.993814
ZURLA ANTONIO ST. MARIA,              406.528460
Name: Price, Length: 720, dtype: float64