# Web Scraping with [Python](https://www.python.org/) using [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and [`requests`](https://2.python-requests.org/en/master/)

The task is to scrape user reviews of a Beiersdorf product and analyze its content by generating a wordcloud.

__Set up__

In [None]:
%load_ext autoreload
%autoreload 2

__Import of Python modules__

In [None]:
from bs4 import BeautifulSoup
import requests
import sys

**Add path to look for modules**

In [None]:
import sys
sys.path.append("../src")

In [None]:
import helper_functions as hf

## Understanding the website

Please check first the [https://www.makeupalley.com/robots.txt](https://www.makeupalley.com/robots.txt) site.

In [None]:
url = 'https://www.makeupalley.com/product/showreview.asp/ItemId=5962/Creme/NIVEA/Lotions-Creams?page=1#reviews'

## Fetching the content of the website using `requests` and  `BeautifulSoup`

In [None]:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
soup

### Find the revelant item where the data is made available

In [None]:
soup.find('div', {"class": '__ReviewTextReadMoreV2__'})

In [None]:
soup.find('div', {"class": '__ReviewTextReadMoreV2__'})['data-text']

### Extract the text information

In [None]:
items = soup.find_all('div', {"class": '__ReviewTextReadMoreV2__'})
print(len(items))
items

### Extract the text data only

In [None]:
" ".join([x["data-text"] for x in items])

### Refactor the logic into a function
* Loop over all findings
* Pick only the attribute `data-text`
* Extract text 

In [None]:
def extract_text_from_soup(soup):
    ## your code here
    pass

In [None]:
??hf.extract_text_from_soup

In [None]:
review_text = hf.extract_text_from_soup(soup)
review_text

In [None]:
print(f'There are {len(review_text.split())} words extracted.')

## Generating a word cloud

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Create and generate a word cloud image:
wordcloud = WordCloud().generate(review_text)


# Display the generated image:
fig, ax = plt.subplots(figsize=(12,6))
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis("off");

***