# Fragrance analysis
Author: [@katychuang](http://katychuang.com)

Description: I want to learn more about what's out there in the fragrance world so am starting this project to collect data. There's no existing API to a database of fragrance information so I'm scraping websites as a way to collect some data for analysis.

## Scrapy example

Code in this notebook is an example of using scrapy to scrape data off one webpage of Fragrantica. The code below is for Python3, Scrapy (1.4.0).

---

### Making requests

Using the requests library, which returns binary data and the scrapy [TextResponse](https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse) module to read the binary data.

Thanks to [@jasonwirth](https://github.com/jasonwirth)'s tip about using user agent strings, I was able to get around the 403 forbidden access error codes while scraping. There are [many user agents](http://www.useragentstring.com/pages/useragentstring.php?name=Firefox) available to use, the [top ones are listed here](https://techblog.willshouse.com/2012/01/03/most-common-user-agents/), and it's conventionally good to rotate/randomize the use of the strings.


In [1]:
import requests
from scrapy.http import TextResponse

url = "https://www.fragrantica.com/designers/Dolce%26Gabbana.html"
user_agent = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/58: .0.3029.110 Chrome/58.0.3029.110 Safari/537.36'}

r = requests.get(url, headers=user_agent)
response = TextResponse(r.url, body=r.text, encoding='utf-8')

Once we have the response, which is a huge chunk of minimized html tags, we need to navigate through the DOM structure to get exactly the information needed. The perfumes are thankfully listed in the tree with the ID `#col1`, so I can start there as the root and get all the perfume names by picking specific child DOM nodes.

In [2]:
# Navigate the perfume list
c = response.xpath('//div[@id="col1"]/div[@class="perfumeslist"]/div/div/p/a//text()').extract()
print("There are {} perfumes from Dolce & Gabanna".format(len(c)))
print(c)

There are 66 perfumes from Dolce & Gabanna
[' Sicily', ' By', ' By', ' D&G', ' D&G', ' Dolce&Gabbana Perfume for Babies', ' Dolce&Gabbana Pour Femme', ' Dolce&Gabbana Pour Femme Intense', ' Dolce&Gabbana Pour Homme', ' Dolce&Gabbana Pour Homme Intenso ', ' D&G Feminine', ' D&G Masculine', " D&G Anthology L'Empereur 4", ' D&G Anthology La Force 11', ' D&G Anthology La Lune 18', ' D&G Anthology La Roue de La Fortune 10', ' D&G Anthology La Temperance 14', ' D&G Anthology Le Bateleur 1', ' D&G Anthology Le Fou 21', ' D&G Anthology L`Amoureux 6', ' D&G Anthology L`Imperatrice 3', ' Dolce', ' Dolce Floral Drops', ' Dolce Rosa Excelsa', ' D&G Light Blue', ' Light Blue Discover Vulcano Pour Homme', ' Light Blue Dreaming in Portofino', ' Light Blue Eau Intense', ' Light Blue Eau Intense Pour Homme', ' Light Blue Escape to Panarea', ' Light Blue Living Stromboli', ' Light Blue Love in Capri', ' Light Blue pour Homme', ' Light Blue Pour Homme Beauty of Capri ', ' Light Blue Sunset in Salina', ' 

The `extract()` method returns a list, so it was easy to get the number of items by finding the length of the list.

---

### Parsing the response
Once you have the response, you can parse the output to get all the bits of information needed. I'm interested in the name of the perfume, the gender it's made for, the image, and also the url to the product detail page. The function `parse_perfume_data()` takes the response and outputs the fields to a list of dictionaries.

In [3]:
def parse_perfume_data(response):
    my_list = []
    for row in response.xpath('//div[@id="col1"]/div[@class="perfumeslist"]'):
      perfume = {}
      perfume['name'] = row.xpath('div/div/p/a//text()').extract()[0]
      perfume['year'] = year(row.xpath('div/div/p/span[@class="mtext"]/span/strong/text()').extract())
      perfume['gender'] = row.xpath('div/@class').extract()[0].split(' ')[1][6:]
      perfume['img'] = row.xpath('div/div/p/a/img//@src').extract()[0]
      perfume['url'] = row.xpath('div/div/p/a/@href').extract()[0]
      my_list.append(perfume)
    return my_list

def year(y):
    if len(y) >= 1:
        return y[0]
    else:
        return ''

Now we call this function while passing in the `response` to get the `data`, which is a structured extraction from the response. I chose some easy to remember field names to use as dictionary keys.

In [4]:
data = parse_perfume_data(response)

Here's how the data looks like after parsing. 

In [5]:
print(data[0])
print(data[1])

{'name': ' Sicily', 'year': '', 'gender': 'female', 'img': 'https://fimgs.net/images/perfume/m.486.jpg', 'url': '/perfume/Dolce-Gabbana/Sicily-486.html'}
{'name': ' By', 'year': '1999', 'gender': 'female', 'img': 'https://fimgs.net/images/perfume/m.489.jpg', 'url': '/perfume/Dolce-Gabbana/By-489.html'}


---

### Data Analysis

Now we can do some quick stats, for example to see how many products per gender

In [6]:
from collections import Counter
Counter(token['gender'] for token in data)

Counter({'female': 33, 'male': 24, 'unisex': 9})

In [7]:
Counter(token['year'] for token in data)

Counter({'': 1,
         '1992': 1,
         '1994': 1,
         '1997': 1,
         '1999': 3,
         '2001': 1,
         '2006': 1,
         '2007': 1,
         '2008': 2,
         '2009': 7,
         '2010': 1,
         '2011': 9,
         '2012': 6,
         '2013': 7,
         '2014': 8,
         '2015': 8,
         '2016': 5,
         '2017': 3})

In [8]:
years = list(sorted(set([p['year'] for p in data[1:]])))
yTotal = Counter(token['year'] for token in data)

print('Year', 'M', 'F', 'U', 'T')
for y in years:
    filtered = list(filter(lambda d: d['year'] == y, data))
    count = Counter(token['gender'] for token in filtered)
    print(y, count['male'], count['female'], count['unisex'], yTotal[y])

Year M F U T
1992 0 1 0 1
1994 1 0 0 1
1997 1 0 0 1
1999 1 2 0 3
2001 0 1 0 1
2006 0 1 0 1
2007 1 0 0 1
2008 1 1 0 2
2009 3 4 0 7
2010 1 0 0 1
2011 1 4 4 9
2012 4 2 0 6
2013 1 3 3 7
2014 4 4 0 8
2015 3 4 1 8
2016 1 4 0 5
2017 1 1 1 3


Values *where*, 

* M = Male
* F = Female
* U = Unisex
* T = Total

 ```
 
 
 
 
 
 
 
 
 
 intentional blank space 
 
 
 
 
 
 
  
   
    
     
      
```

---

# Appendix

In [9]:
data

[{'gender': 'female',
  'img': 'https://fimgs.net/images/perfume/m.486.jpg',
  'name': ' Sicily',
  'url': '/perfume/Dolce-Gabbana/Sicily-486.html',
  'year': ''},
 {'gender': 'female',
  'img': 'https://fimgs.net/images/perfume/m.489.jpg',
  'name': ' By',
  'url': '/perfume/Dolce-Gabbana/By-489.html',
  'year': '1999'},
 {'gender': 'male',
  'img': 'https://fimgs.net/images/perfume/m.490.jpg',
  'name': ' By',
  'url': '/perfume/Dolce-Gabbana/By-490.html',
  'year': '1997'},
 {'gender': 'male',
  'img': 'https://fimgs.net/images/perfume/m.483.jpg',
  'name': ' D&G',
  'url': '/perfume/Dolce-Gabbana/D-G-483.html',
  'year': '1994'},
 {'gender': 'female',
  'img': 'https://fimgs.net/images/perfume/m.484.jpg',
  'name': ' D&G',
  'url': '/perfume/Dolce-Gabbana/D-G-484.html',
  'year': '1992'},
 {'gender': 'unisex',
  'img': 'https://fimgs.net/images/perfume/m.23597.jpg',
  'name': ' Dolce&Gabbana Perfume for Babies',
  'url': '/perfume/Dolce-Gabbana/Dolce-Gabbana-Perfume-for-Babies-2359

In [16]:
# display D&G Perfume Bottles
from IPython.display import Image, HTML, display
from glob import glob

def make_html(image):
     return '<img src="{}" style="display:inline;margin:1px"/>'.format(image)

item = ''.join( [make_html(x['img']) for x in data] )
display(HTML(item))



---
This notebook was created by [Dr. Kat](http://github.com/katychuang) for [macbookandheels.com](http://macbookandheels.com). Please give credit if any of the contents are re-used.

For any questions, comments, and suggests please contact via [twitter @katychuang](http://twitter.com/katychuang) or [github @katychuang](http://github.com/katychuang).