## 2659-T1 Data Curation for Business Analytics Final Exam Part 2

This part of final exam will contain 1 question with subquestions for 70% of the total points.

### Chanel No.5 or Burnt Hair?

<img src="https://upload.wikimedia.org/wikipedia/commons/8/85/CHANEL_No5_parfum.jpg" alt="burnthair" width="300"/>

<img src="https://www.teslarati.com/wp-content/uploads/2022/10/Elon-Musk-sold-28700-bottles-of-Burnt-Hair-perfume.png" alt="burnthair" width="800"/>

The word perfume derives from the Latin perfumare, meaning "to smoke through". Perfumery, as the art of making perfumes, began in ancient Mesopotamia, Egypt, the Indus Valley civilization and possibly Ancient China.

The precise formulae of commercial perfumes are kept secret. Even if they were widely published, they would be dominated by such complex ingredients and odorants that they would be of little use in providing a guide to the general consumer in description of the experience of a scent. Nonetheless, connoisseurs of perfume can become extremely skillful at identifying components and origins of scents in the same manner as wine experts.

**Perfume is described in a musical metaphor as having three sets of notes, making the harmonious scent accord.** The notes unfold over time, with the immediate impression of the top note leading to the deeper middle notes, and the base notes gradually appearing as the final stage. These notes are created carefully with knowledge of the evaporation process of the perfume.

- Top notes: Also called the head notes. The scents that are perceived immediately on application of a perfume. Top notes consist of small, light molecules that evaporate quickly. They form a person's initial impression of a perfume and thus are very important in the selling of a perfume. Examples of top notes include mint, lavender and coriander.
- Middle notes: Also referred to as heart notes. The scent of a perfume that emerges just prior to the dissipation of the top note. The middle note compounds form the "heart" or main body of a perfume and act to mask the often unpleasant initial impression of base notes, which become more pleasant with time. Examples of middle notes include seawater, sandalwood and jasmine.
- Base notes: The scent of a perfume that appears close to the departure of the middle notes. The base and middle notes together are the main theme of a perfume. Base notes bring depth and solidity to a perfume. Compounds of this class of scents are typically rich and "deep" and are usually not perceived until 30 minutes after application. Examples of base notes include tobacco, amber and musk.

We are going to examine a perfume dataset manually scraped from [fragrantica](https://www.fragrantica.com/)  which provides comprehensive information about perfumes, branded or not, with their compounds and popularity. 

This dataset describes rating, votes and their compounds that include top notes, middle notes and base notes. The data are contained in the file fragrance.csv. More details about the contents and use of all these files follows.

**Fragrance Data File Structure (fragrance.csv)**
All fragrance are contained in the file fragrance.csv. Each line of this file after the header row represents one frgrance and has the following format:
`name, brand, rating, number_votes, top_notes, middle_notes, base_notes`


Answer the following questions using the provided dataset. You can write down intermediate results towards the final answers

In [1]:
import pandas as pd
import numpy as np

### Question 1 (15 points)

Read the csv file as the dataframe with the column names as "Name, Brand, Rating, Number_votes, Top_notes, Middle_notes, Base_notes". You may have noticed that errors and inconsistencies may exist in these files as this is scraped from the website: there are missing values for ratings and number of votes. You should fill them with the median values of all fragrances first. Then, the ratings contain mixed used of dot and comma for separating decimals. You need to correct this as well. 

The ratings should be made on a 5-point scale, with half-star increments (0.5 - 5.0). So if we will need to round the ratings to the nearest 0.5-point. For example, if a movie is rated 3.6, then you need to change the value to 3.5. Similarly, if a movie is rated 3.8, then it should be changed to 4.

```
def round_to_nearest_half_int(num):
    return round(num * 2) / 2

```

Once you are done, make sure that rating is just one decimal (e.g. 3.5, 4.0) and the number of votes is the integer. 

Note: if you cannot load the csv file, it might be some errors in the data that you need to address first.

In [2]:
headers = ["name", "brand", "rating", "number_votes", "top_notes", "middle_notes", "base_notes"]

perfumeDF = pd.read_csv("fragrance.csv", skiprows = 1, delimiter = ";").set_axis(headers,axis=1)
perfumeDF

Unnamed: 0,name,brand,rating,number_votes,top_notes,middle_notes,base_notes
0,Angels' Share,By Kilian,4.31,682.0,['Cognac'],"['Cinnamon', 'Tonka Bean', 'Oak']","['Praline', 'Vanilla', 'Sandalwood']"
1,My Way,Giorgio Armani,3.5700000000000003,1471.0,"['Orange Blossom', 'Bergamot']","['Tuberose', 'Indian Jasmine']","['White Musk', 'Madagascar Vanilla', 'Virginia..."
2,Libre Intense,Yves Saint Laurent,4.02,858.0,"['Lavender', 'Mandarin Orange', 'Bergamot']","['Lavender', 'Tunisian Orange Blossom', 'Jasmi...","['Madagascar Vanilla', 'Tonka Bean', 'Ambergri..."
3,Dior Homme 2020,Christian Dior,3.42,1402.0,"['Bergamot', 'Pink Pepper', 'elemi']","['Cashmere Wood', 'Atlas Cedar', 'Patchouli']","['Iso E Super', 'Haitian Vetiver', 'White Musk']"
4,Acqua di Giò Profondo,Giorgio Armani,4.03,869.0,"['Sea Notes', 'Aquozone', 'Bergamot', 'Green M...","['Rosemary', 'Cypress', 'Lavender', 'Mastic or...","['Mineral notes', 'Musk', 'Patchouli', 'Amber']"
...,...,...,...,...,...,...,...
4817,Jean-Louis Scherrer Eau de Parfum,Jean-Louis Scherrer,4,1,"['Italian', 'Tangerine', 'Calabrian', 'bergamo...","['iris', 'Jasmine', 'Bulgarian', 'Rose']","['Patchouli', 'Bourbon', 'Vetiver', 'Mysore', ..."
4818,Iranzol Perfume Oil,Bruno Acampora,434,35,"['Sandalwood', 'Musk']","['Galbanum', 'Amber', 'Jasmine', 'Rose']","['Patchouli', 'Vanilla']"
4819,Night Scented Stock,Penhaligon's,393,256,"['Clove', 'Cinnamon']","['Ylang-', 'Ylang', 'Lily', 'Heliotrope', 'Vio...","['Tonka', 'Bean', 'Benzoin', 'Musk', 'Sandalwo..."
4820,Complice,Coty,425,68,"['Aldehydes', 'African', 'Orange', 'flower', '...","['Lily-of-the-', 'Valley', 'Rose', 'Lilac', 'N...","['oak', 'moss', 'Vetiver', 'Musk', 'Benzoin', ..."


In [3]:
perfumeDF["rating"] = perfumeDF["rating"].str.replace(",", ".", regex=True)
perfumeDF["rating"] = perfumeDF["rating"].astype(float)

In [4]:
perfumeDF["number_votes"] = perfumeDF["number_votes"].str.replace(",", "", regex=True)
perfumeDF["number_votes"] = perfumeDF["number_votes"].astype(float)

In [5]:
median_ratings = perfumeDF.rating.median()
median_votes = perfumeDF.number_votes.median()

perfumeDF["rating"] = perfumeDF.rating.fillna(median_ratings)
perfumeDF["number_votes"] = perfumeDF.number_votes.fillna(median_votes)
perfumeDF["number_votes"] = perfumeDF["number_votes"].astype(int)

In [6]:
median_ratings

4.06

In [7]:
perfumeDF.rating.dtype

dtype('float64')

In [8]:
perfumeDF.number_votes.dtype

dtype('int64')

In [9]:
def round_to_nearest_half_int(num):
    return round(num * 2) / 2

perfumeDF["rating"] = perfumeDF.rating.apply(lambda x: round_to_nearest_half_int(x))

### Question 2 (10 points)

What is the top 10 most highly rated brands (in terms of **average rating**)?

In [10]:
perfumeDF.groupby("brand")["rating"].mean().sort_values(ascending = False).head(10)

brand
Holistick               5.000000
Bienaimé                5.000000
Flora                   5.000000
Mary McFadden           5.000000
Walter Wolf             5.000000
Ellen Betrix            5.000000
Aromática Cosméticos    5.000000
Jean d'Albret           4.875000
Galimard                4.750000
Corday                  4.666667
Name: rating, dtype: float64

### Question 3 (10 points)

Categorize fragrances into different quartiles of popularity (in terms of **number of votes**), you need to also set the popularity level as "First Level", "Second Level", "Third Level", and "Fourth Level", respectively. Show the top 3 brands with the most number of fragrances at each popularity level, respectively.

In [11]:
label_names = ["First Level", "Second Level", "Third Level", "Fourth Level"]

perfumeDF["quartiles"] = pd.cut(perfumeDF.number_votes, 4, labels = label_names)

# Group by "quartiles" first
quartile_groups = perfumeDF.groupby("quartiles")

# Define a function to find the top 3 brands within each quartile
def top_brands(group):
    return group["brand"].value_counts().nlargest(3)

# Apply the function to each quartile group
quartile_groups.apply(top_brands)

quartiles                      
First Level   Guerlain             118
              Dior                  87
              Chanel                80
Second Level  Guerlain              14
              Chanel                14
              Dior                  11
Third Level   Dior                   8
              Versace                4
              Narciso Rodriguez      3
Fourth Level  Mugler                 4
              Chanel                 3
              Dior                   3
Name: brand, dtype: int64

In [12]:
perfumeDF[(perfumeDF.brand == "Dior") & (perfumeDF.quartiles == "Second Level")].count()

name            11
brand           11
rating          11
number_votes    11
top_notes       11
middle_notes    11
base_notes      11
quartiles       11
dtype: int64

### Question 4 (10 points)
Does the length of the fragrance name have any relations to its rating? Let's try to create a **pivot table** with the row of popularity level and column of long_name (above the median name length) and short_name (below the median name length) and value of average ratings. So you need to first create a new column "Name_length_long" to indicate whether the length of the fragrance name is aboveor below the median name length of all fragrances. Explain your findings.

In [13]:
median_name = perfumeDF.name.str.len().median()
perfumeDF["Name_length_long"] = "median"
perfumeDF.loc[perfumeDF.name.str.len() < median_name, "Name_length_long"] = "short_name"
perfumeDF.loc[perfumeDF.name.str.len() > median_name, "Name_length_long"] = "long_name"

pd.pivot_table(perfumeDF, index = "quartiles", columns = "Name_length_long", values = "rating", aggfunc = "mean")

Name_length_long,long_name,median,short_name
quartiles,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
First Level,4.083648,4.023649,4.045926
Second Level,4.073034,4.041667,3.918803
Third Level,3.973684,,4.017857
Fourth Level,3.916667,,3.9


### Question 5 (10 points)

For Chanel perfume, what is the most popular note choice for **top note, middle note and base note, respectively** (e.g. the most popular one out of Rose, Jasmine, Orange, etc. for top note and then for middle note and base note)?  Note that the top/middle/base notes may contain several notes and so you need to split them first. How many Chanel perfumes contain all three most popular notes?

In [14]:
perfumeDF

Unnamed: 0,name,brand,rating,number_votes,top_notes,middle_notes,base_notes,quartiles,Name_length_long
0,Angels' Share,By Kilian,4.5,682,['Cognac'],"['Cinnamon', 'Tonka Bean', 'Oak']","['Praline', 'Vanilla', 'Sandalwood']",First Level,median
1,My Way,Giorgio Armani,3.5,1471,"['Orange Blossom', 'Bergamot']","['Tuberose', 'Indian Jasmine']","['White Musk', 'Madagascar Vanilla', 'Virginia...",First Level,short_name
2,Libre Intense,Yves Saint Laurent,4.0,858,"['Lavender', 'Mandarin Orange', 'Bergamot']","['Lavender', 'Tunisian Orange Blossom', 'Jasmi...","['Madagascar Vanilla', 'Tonka Bean', 'Ambergri...",First Level,median
3,Dior Homme 2020,Christian Dior,3.5,1402,"['Bergamot', 'Pink Pepper', 'elemi']","['Cashmere Wood', 'Atlas Cedar', 'Patchouli']","['Iso E Super', 'Haitian Vetiver', 'White Musk']",First Level,long_name
4,Acqua di Giò Profondo,Giorgio Armani,4.0,869,"['Sea Notes', 'Aquozone', 'Bergamot', 'Green M...","['Rosemary', 'Cypress', 'Lavender', 'Mastic or...","['Mineral notes', 'Musk', 'Patchouli', 'Amber']",First Level,long_name
...,...,...,...,...,...,...,...,...,...
4817,Jean-Louis Scherrer Eau de Parfum,Jean-Louis Scherrer,4.0,1,"['Italian', 'Tangerine', 'Calabrian', 'bergamo...","['iris', 'Jasmine', 'Bulgarian', 'Rose']","['Patchouli', 'Bourbon', 'Vetiver', 'Mysore', ...",First Level,long_name
4818,Iranzol Perfume Oil,Bruno Acampora,4.5,35,"['Sandalwood', 'Musk']","['Galbanum', 'Amber', 'Jasmine', 'Rose']","['Patchouli', 'Vanilla']",First Level,long_name
4819,Night Scented Stock,Penhaligon's,4.0,256,"['Clove', 'Cinnamon']","['Ylang-', 'Ylang', 'Lily', 'Heliotrope', 'Vio...","['Tonka', 'Bean', 'Benzoin', 'Musk', 'Sandalwo...",First Level,long_name
4820,Complice,Coty,4.0,68,"['Aldehydes', 'African', 'Orange', 'flower', '...","['Lily-of-the-', 'Valley', 'Rose', 'Lilac', 'N...","['oak', 'moss', 'Vetiver', 'Musk', 'Benzoin', ...",First Level,short_name


In [15]:
def cleaning(str_):
    str_ = str_.replace('[', '')
    str_ = str_.replace(']', '')
    str_ = str_.replace("'", "")
    str_ = str_.replace('"', '')
    
    list_ = str_.split(",")
    for i, element in enumerate(list_):
        list_[i] = element.strip()
    
    return list_

perfumeDF["top_notes"] = perfumeDF.top_notes.apply(cleaning)
perfumeDF["middle_notes"] = perfumeDF.middle_notes.apply(cleaning)
perfumeDF["base_notes"] = perfumeDF.base_notes.apply(cleaning)

In [16]:
ChanelDF = perfumeDF[perfumeDF.brand == "Chanel"]
ChanelDF

MVP_top_note = ChanelDF.top_notes.explode().value_counts().nlargest(1)
MVP_middle_note = ChanelDF.middle_notes.explode().value_counts().nlargest(1)
MVP_base_note = ChanelDF.base_notes.explode().value_counts().nlargest(1)

print("MVP Top Note:", MVP_top_note.reset_index().iloc[0,0])
print("MVP Middle Note:", MVP_middle_note.reset_index().iloc[0,0])
print("MVP Base Note:", MVP_base_note.reset_index().iloc[0,0])

MVP Top Note: Bergamot
MVP Middle Note: Jasmine
MVP Base Note: Musk


### Question 6 (15 points)

Chanel No.5 Eau de Parfum has very classic notes combination. For example, Bergamot for top note, Rose for Middle note, and Vanilla for Base note. If you like it, you probably want to explore what other perfume you might also like. So now design personalized recommendations for Elon Musk to choose the perfumes similar to Chanel No.5 as follows.

1. Identify the perfumes with the same note combination (i.e., use **Bergamot** for top note, **Rose** for Middle note, and **Vanilla** for Base note)
2. From these perfumes, let's get the most popular ones (i.e., they are in the **Fourth Level** of popularity) 
3. Now add some "Musk" flavor to choose perfumes that also have **Musk** in their Base note
4. Show top 3 perfume and brands with the highest **Rating**






In [17]:
# Bergamot for top note, Rose for Middle note, and Vanilla for Base note
selectionDF = perfumeDF[(perfumeDF["top_notes"].str.contains("Bergamot", regex=False)) & (perfumeDF["middle_notes"].str.contains("Rose", regex=False)) & (perfumeDF["base_notes"].str.contains("Vanilla", regex=False))]

# most popular
selectionDF_2 = selectionDF[selectionDF.quartiles == "Fourth Level"]

# Musk in base note
selectionDF_3 = selectionDF_2[(selectionDF_2["base_notes"].str.contains("Musk", regex=False))]
#print(selectionDF_3[["name", "brand", "rating", "top_notes", "middle_notes", "base_notes"]])

# Show top 3 perfume and brands with the highest Rating
selectionDF_3.groupby(["brand", "name"])["rating"].mean().nlargest(3)


brand   name             
Chanel  Coco Mademoiselle    4.0
Dior    J'adore              4.0
Mugler  Angel                3.5
Name: rating, dtype: float64

Congratulations! I trust your recommendations to Elon Musk are surely better than his Burnt Hair :)