# Web Scraping 

---

### Singulier


---


Install the required package if it's not yet done

In [None]:
!pip3 install requests lxml pandas wordcloud

# Import the needed packages

#### Import package to send an HTTP request.
Sending an HTTP request = exactly what you do on your browser when typing the url of a website and pressing 'enter'.<br>
FYI, when you type a url in your browser, it is sending a request to server which will give back the html to display (simplified). 

<img src="https://img.webnots.com/2013/06/HTTP-Request-and-Response-Over-Web-1.png" width="50%">

In [None]:
import requests

#### Import a package to handle the HTML
HTML is the language in which websites are written. Your browser receives the HTML and displays it. Here we need a package to be able to read the HTML easily

<img src="http://coderstutorials.com/wp-content/uploads/2020/06/my-photo.png" width="50%">

In [None]:
from lxml import html

In [None]:
# Import package to print lists/dictionnaries nicely
from pprint import pprint

# Initiation to Scrapping

### First, let's see some basics of HTML:<br>

**HTML is composed of tags.** The main ones we are going to see are:

|Tag name|Description|
|:-:|:-|
|div|Defines a section in a document|
|span|Defines a section in a document|
|p|Defines a paragraph|
|h1, h2... h6|Titles from higher to lower importance|
|a|Defines a hyperlink|



A comprehensive list can be found here : https://www.w3schools.com/tags/default.asp.

Tags are usually nested, e.g:

```html
<!DOCTYPE html>
<html>
    <head>
         <title>This is a title</title>
    </head>
    <body>
        <div>
            <p>Hello world!</p>
        </div>
    </body>
</html>
```

**Attributes are used to provide additional information about HTML elements.**

```html
<a href="https://www.google.com">My first link</a>
```

The `a` tag defines an hyperlink, meaning (to keep it simple) you can click on it. The href attributes defines the link's destination. The text between `<a>` and `</a>` will be visible by the user. Here is how you would see: <a href="https://www.google.com">My first link</a>

In a more general way, you will find things such as:
```html
<tag attribute_key=attribute_value>...</tag>
```

`href` is an attribute that can be used only with the tag `a`. However, some attributes can be used accross all tags, such as a very useful and common one: `class`. As it name suggests, the class attribute is used to specify a class for an HTML element. Multiple tags can share the same class. The main purpose of this attribute is to be able to identify multiple objects that share same features. For instance, all titles may have the same class name, all `div` tags that contain the same kind of information will often have the same class name.

### Let's see a generic, simple and common example:

```html
<div class="result">
    <h1 class="title">My Result 1</h1>
    <p class="info">My Info 1</p>
</div>
<div class="result">
    <h1 class="title">My Result 2</h1>
    <p class="info">My Info 2</p>
</div>
```

Here, the results are within the `div` element with class `result`. Each result has title, in `h1` element with class `title`, and information, in `p` element with class `info`.

### Here comes the scrapping

If you want to scrap these information, you'll have to:

- Get all the results of the page, which means all the `div` elements with class `result`, and for each one:
    - Retrieve the title, in other words, what's in the element `h1` with class `title`
    - and the information that is contained in the `p` element with class `info`


### How to retrieve the information you want

The package lxml we used will find the information we asked for, as long as we give enough details. To give these details, we are going to use the XPath.

Without entering into too much details, the XPath is way of describing a path that will help you navigate through the HTML. For this matter, the HTML is considered as a tree, where each tag is "node". Below, tags and nodes refer to the same thing.

*The tables below are from https://www.w3schools.com/xml/xpath_syntax.asp*

```html
<body>
    <div>
        <div class="result">
            <h1 class="title"><a href="link/to/my/resultpage1">My Result 1</a></h1>
            <p class="info">My Info 1</p>
        </div>
    </div>
</body>
```

To help you visualize how XPath works, consider this HTML as your computer directory, where each tag is a folder:
- The main folder is `body`, 
    - inside it you have a `div` folder, 
        - which contains a `div` folder 
            - which contains a `h1` and a `p` folder. 
                - The `h1` folder contains a `a` folder 
                     - that contains some text, 
                - and the `p` folder contains some text.

As in your computer, a path can be writtent as: `/body/div/div/h1/a`. The XPath follows this structure, while adding a few workaround to ease the identification of the information you are looking for.

#### Selecting Nodes

|Expression|Description|Exemple|
|:-:       |:-|:-|
|/         |Selects from the root node when placed at the start |/h1|
|//        |Selects nodes in the document from the current node that match the selection no matter where they are|//p|
|.         |Selects the current node|.|
|..        |Selects the parent of the current node|..|
|nodename  |Selects all nodes with the name "nodename"|div|
|@         |Selects attributes|@href|


#### Path

|Expression Path|Description|
|:-:       |:-|
|/div        |Selects the root element div |
|//div        |Selects all div tags no matter where they are in the document|
|//div/h1        |Selects all h1 tags that are children of a div tag no matter where they are in the document|
|//div[@class='result']        |Selects all div tags with class "result" no matter where they are in the document|
|//div/h1[@class='title']        |Selects all h1 tags that have "title" as class and that are children of a div tag|
|//div[contains(@class, 'title')]        |Selects all div tags that have a class that contains 'title', thus it can be 'main-title', 'title ' etc.|
|//*  |Selects all nodes|
|//*[@class='result']        |Selects all nodes with class "result"|
|//div[1]        |Selects the first div tag no matter where it is in the document|
|//a/@href        |Returns the href attribute of all "a" tags|
|//h1/text()       |Returns the text of all h1 tags|


### Small Example

Let's say our page is as below
```html
<body>
    <div>
        <div class="result">
            <h1 class="title"><a href="link/to/my/resultpage1">My Result 1</a></h1>
            <p class="info">My Info 1</p>
        </div>
    </div>
</body>
```
And we want to retrieve three piece of information

<ul>
    <li>The title (My Result 1)</li>
    <li>The link (link/to/my/resultpage1)</li>
    <li>The information (My Info 1)</li>
</ul>

- `//div` will return two elements,

    - the first one being:

    ```html
    <div>
        <div class="result">
            <h1 class="title"><a href="link/to/my/resultpage1">My Result 1</a></h1>
            <p class="info">My Info 1</p>
        </div>
    <div>
    ```

    - the second one being:

    ```html
    <div class="result">
        <h1 class="title"><a href="link/to/my/resultpage1">My Result 1</a></h1>
        <p class="info">My Info 1</p>
    </div>
    ```

- `//div[@class='result']` will return:
    ```html
    <div class="result">
        <h1 class="title"><a href="link/to/my/resultpage1">My Result 1</a></h1>
        <p class="info">My Info 1</p>
    </div>
    ```
    
That's what we need! Now that we have our result `div`, we have to look for the title. We have to start from the current "node", so our path will start with a dot (`.`), thus
`./h1[@class='title']` will return 
```html
<h1 class="title"><a href="link/to/my/resultpage1">My Result 1</a></h1>
```
Hence, the title can be retrieved with `./h1[@class='title']/text()`

and the link with `./h1[@class='title']/a/@href` (Note that it can also be retrieve with `.//a/@href`)

In the same way, the information can be retieved with `./p[@class='info']/text()`

# Real life example

Let's scrap the reviews from the Trustpilot page of Trip Mate.

In [None]:
# Write the url you want to scrap
url = "https://www.trustpilot.com/review/tripmate.com"

# 'Request' the HTML of the page
http_request = requests.get(url)

# Retrieve its content
page_content = http_request.content

# Display the HTML to see what it looks like
print(page_content)

As you can see, it is very difficult to read it, let alone find the information you are looking for. In fact, in addition to the text you see on your browser, it contains a lot of information, such as the format (bold, italic, font size, color etc.).<br>
Let's pass it to the package we imported.

In [None]:
# Transform the HTML content to the right format
page_html = html.fromstring(page_content)
print(page_html)

The package offers functions to get information by class name, Id etc. but we'll use only the <strong>xpath</strong> method as it is the more generic one.<br>

The first thing we have to do is to identify where is the information we want to scrap. There is a very easy way to do so:
- Go on your browser, and go to the Trustpilot page of Trip Mate https://www.trustpilot.com/review/tripmate.com
- Highlight the information you want to scrap (or an information near it) and right click on it.<br>
    <img src="https://i.ibb.co/0XgBzw1/Screenshot-2020-11-06-at-11-35-02-PM.png" alt="Screenshot-2020-11-06-at-11-35-02-PM" border="0" width="50%">
- Click on "inspect" and you should have a pane on the right of the screen with some HTML code. If you don't see HTML code, go on the tab "Elements".<br>
<img src="https://i.ibb.co/d4ZhjbJ/Screenshot-2020-11-06-at-11-31-43-PM.png" alt="Screenshot-2020-11-06-at-11-31-43-PM" border="0" width="90%">
- On this pane you can see the HTML code of the page. If you have correctly highlighted the information you wanted, a piece of HTML code should be highlighted too. Often, as the HTML code is highly nested, you don't have all the information you wanted, so you'll have to look at the "context", meaning what's around and inside the highlighted component.
- Here is a sample of what you should see (simplified):<br>
    ```html
    <section class="review__content">
        <div class="review-content">
            <div class="review-content__header" v-pre="">...</div>
            <div class="review-content__body" v-pre="">
                <h2 class="review-content__title">
                        <a class="link link--large link--dark" href="/reviews/5f93ee04798e6f04a41c33fa" data-track-link="{'target': 'Single review', 'name': 'review-title'}">Everything arrived as ordered</a>
                </h2>
            </div>
        </div>
    </section>
    ```

- Whenever you hover on a tag, you'll see the corresponding component on the page being highlighted. This will help you understand how the website has been structured. By highlighting the tag `<div class="review-card  ">` you can see that the whole review is highlighted. By hovering on other tags you will notice that either the tag is within this div and the highlighted part is within the first review, or the tag is outside this div and other parts are highlighted. It means that the block `<div class="review-card  ">` contains all the information of the first review.<br>
    <img src="https://i.ibb.co/yWJq83C/Screenshot-2020-11-06-at-11-46-04-PM.png" alt="Screenshot-2020-11-06-at-11-46-04-PM" border="0" width="90%">
- In the meantime, if you click on the arrow next to the `<div class="review-card  ">`, the block will shrink, hiding everything inside. However, you can also notice that all the following "div" tags are very similar to this one, except a few of them...<br>
<img src="https://i.ibb.co/x2BZ9PN/Screenshot-2020-11-06-at-11-59-09-PM.png" alt="Screenshot-2020-11-06-at-11-59-09-PM" border="0" width="50%"> <br>*Try to hover on each one and understand what they correspond to.*

- **Tip:** Whenever you see that a class ends with a space, it means that the tag may have multiple classes separated by spaces. For instance, here, `<div class="review-card  ">` ends with space. And if you look at the other div tags you'll find `<div class="review-card  review-card--has-stack">.` It means that this div is a 'review-card' but has something special. In that case, if you want to get all the review-card, you'll have to use the 'contains' keyword in your path (see above in the table with Path Expressions). In some cases, you won't see any div with multiple classes, however, still use the 'contains' keywords as they might show up in other pages.
    

## Let's scrap all the results

Write a path to select all the div that contain a result. Check the number of results you get with how many reviews there are on the webpage.

Please remember that the computer doesn't make any mistake, if it doesn't give you what you want, it means that you didn't ask for you wanted

In [None]:
# Write here the path to div tags that contain the results.
# Check the tables above if you don't remember the syntax
xpath_results = ""

# The package lxml will find the objects that correspond to what you asked for.
results = page_html.xpath(xpath_results)

# Print the number of results and check that it is correct by counting them on the webpage.
n_results = len(results)
print(f"{n_results} were found")

#### Let's print the results to see what it looks like

In [None]:
pprint(results)

Doesn't tell us much...except that we have "div" elements, which is what we asked for.

Now let's focus on only one result, and try to get the information we need.



### Start by scrapping one review

In [None]:
# Res will contain only the first result
first_res = results[0]

# Let's see the text it contains
pprint(first_res.xpath(".//text()"))

Now we have the div element of the first result. As we saw above, we should have all the information about the first review contained in this "div" element.

Let's find the xpath to the title of the review and to the content of the review.

**Hints:**
- Remember that we start from the "current node" so our paths should start by "." (otherwise it will look for elements anywhere in the document, even outside the div element.
- To retrieve some text, don't forget to add '//text()' at the end of the xpath, otherwise you will get a node.

In [None]:
# Write here the path to the title.
xpath_title = "Fill in here the xpath to the rating"

# The package lxml will find the objects that correspond to what you asked for.
# The function will always return a list, either empty or containing the matching objects
title = first_res.xpath(xpath_title)

# Print the title
print(f"The title of the review is: '{title}'")

Let's clean the title.

In [None]:
# join the list to have a string
cleaned_title = "".join(title)
print(cleaned_title)

In [None]:
# remove '\n' and useless spaces
cleaned_title = cleaned_title.strip()
print(cleaned_title)

We will need to clean other information, so let's put the cleaning process into a function

In [None]:
def clean_text(text):
  """
  Function to clean a text.
  Takes a string and returns a string
  """
  # join the list to have a string
  cleaned_text = "".join(text)
  # remove '\n' and useless spaces
  cleaned_text = cleaned_text.strip()
  return cleaned_text

In [None]:
# Write here the path to the content.
xpath_content = "Fill in here the xpath to the rating"

# The package lxml will find the objects that correspond to what you asked for.
# Write a piece of code to get the content
content = first_res.xpath(xpath_content)

# Clean the content
cleaned_content = clean_text(content)

# Print the content
print(f"The content of the review is: '{cleaned_content}'")

In [None]:
# Write here the path to the rating.
xpath_rating = "Fill in here the xpath to the rating"

# The package lxml will find the objects that correspond to what you asked for.
# Write a piece of code to get the rating (see above if you need help)
rating = "Write a piece of code to get the rating"

# Clean the rating
# Write a piece of code to clean the rating
cleaned_rating = 

# Print the content
print(f"The rating of the review is: '{cleaned_rating}'")

In [None]:
# Write here the path to the date.
xpath_date = "Fill in here the xpath to the date"

# The package lxml will find the objects that correspond to what you asked for.
# Write a piece of code to get the date (see above if you need help)
info_dates = "Write a piece of code to get the date"

# Clean the date
cleaned_info_dates = clean_text(info_dates)

# Print the content
print(f"The info on the date of the review are: '{cleaned_info_dates}'")

Hmmm Not exactly what we expected... We have to further clean it!

In [None]:
# The date information starts 16 characters after the start of 'publishedDate'
# Thus, let's find where the 'publishedDate' string starts
date_index = cleaned_info_dates.find("publishedDate")
print(date_index)

In [None]:
# The date information starts 16 characters after the start of 'publishedDate'
date_start_index = date_index + 16
# The date information has 10 characters'
date_end_index = date_start_index + 10
# We know where the date starts and when it ends, let's select it
date = cleaned_info_dates[date_start_index:date_end_index]
print(date)

In [None]:
# # Other (nicer) way to do it

# # Clean the date
# import json
# # In the general case, avoid putting imports in the middle of the code
# # All imports must be at the top of the file
# # However, this is a training file, so it's ok

# info_date_dict = json.loads(cleaned_info_dates)
# print(f"The info cleaned on the date of the review are: '{info_date_dict}'")

# print(f"Type of cleaned_info_dates: {type(cleaned_info_dates)}")
# print(f"Type of info_date_dict: {type(info_date_dict)}")

# date = info_date_dict["publishedDate"]
# print(f"The date of the review is: '{date}'")

In [None]:
# Let's see everything

print(f"The title of the review is: '{cleaned_title}'")
print(f"The content of the review is: '{cleaned_content}'")
print(f"The rating of the review is: '{cleaned_rating}'")
print(f"The date of the review is: '{date}'")

Very nice! You just scrapped the information from the first review. All that remains is to do the same for all the other reviews.

## Scrap all the reviews from the page

First, let's put everything into a function.
Let's write a function that:
<ul>
    <li>takes as parameter a "div" element containing the review</li>
    <li> and returns a dictionnary with the keys ["title", "content"] and their cleaned values.</li>
</ul>

In [None]:
def parse_review(review_block):
  """
  Create a function to parse a review.
  Takes an HTML element containing the review and returns a dictionnary with cleaned information
  """
  # Create a dictionnary to store the results
  info = dict()
  # Write here the path to the title.
  xpath_title = ".//h2//text()"
  # Retrieve the title
  title = review_block.xpath(xpath_title)
  # Clean the title
  cleaned_title = clean_text(title)
  # Store the title
  info["title"] = cleaned_title
  # Same thing with the content
  xpath_content = ".//p[@class='review-content__text']//text()"
  content = review_block.xpath(xpath_content)
  cleaned_content = clean_text(content)
  info["content"] = cleaned_content
  # Same thing with the rating
  xpath_rating = ".//img/@alt"
  rating = review_block.xpath(xpath_rating)
  cleaned_rating = clean_text(rating)
  info["rating"] = cleaned_rating
  # Same thing with the date, don't forget to clean it
  xpath_date = ".//script[@data-initial-state='review-dates']//text()"
  date = review_block.xpath(xpath_date)
  cleaned_info_dates = clean_text(date)
  date_index = cleaned_info_dates.find("publishedDate")
  date_start_index = date_index + 16
  date_end_index = date_start_index + 10
  cleaned_date = cleaned_info_dates[date_start_index:date_end_index]
  info["date"] = cleaned_date
  return info

Let's try it

In [None]:
pprint(parse_review(first_res))

Nice! Let's do the same for all the reviews we got in results

In [None]:
# Create a list to store the scrapped information
all_reviews_info = []
# Explore all reviews
for review in results:
    # For each review, get the information of the review, call the function 'parse_review' with the parameter 'review'
    review_info = 
    # Store them in the list all_reviews_info
    all_reviews_info.append(review_info)

pprint(all_reviews_info)

Great! We just scrapped the information contained in the first page of Trip Mate reviews.

Now, let's put that in a function that
<ul>
    <li>takes as parameter the html page to scrap </li>
    <li> and returns a list of dictionnaries with the keys ["title", "content"] and their cleaned values.</li>
</ul>

In [None]:
def parse_page(page_html):
    # Write the xpath of the result blocks
    xpath_results = "//div[contains(@class, 'review-card')]"
    # Get all the reviews
    all_results = page_html.xpath(xpath_results)
    # Create a list to store the scrapped information
    all_reviews_info = []
    # Explore all reviews
    for review in all_results:
        # For each review, get the information of the review
        review_info = parse_review(review)
        # Store them in the list all_reviews_info
        all_reviews_info.append(review_info)
    return all_reviews_info

Let's try it

In [None]:
all_reviews = parse_page(page_html)
pprint(all_reviews)

Awesome! You have a function that gets an html page and returns all the reviews from the page. What's left now is to apply this function on all the available pages.

### Scrap all the reviews available on the website

To scrap all the pages, we have two choices:
1. Generates a url and scrap it if the page exists. It it possible to do so, as the urls follow a similar pattern:
    - Page 1: https://www.trustpilot.com/review/tripmate.com
    - Page 2: https://www.trustpilot.com/review/tripmate.com?page=2
    - Page 3: https://www.trustpilot.com/review/tripmate.com?page=3
    - etc.<br>
Therefore, you could generate each url and try to scrap it. Although it may be useful in some cases (especially in parallelization), we won't use it here.
2. In each page, you have a "next" button. Hence, we are going to get the link of the next page from there. If there is no link, then we stop the process.

In [None]:
# Write here the path to the next page.
xpath_next_link = "Fill in here the xpath to the content"

# Retrieve the link to the next page
res_next_link = page_html.xpath(xpath_next_link)
res_next_link_cleaned = clean_text(res_next_link)
# Print the result of the next link
print(res_next_link_cleaned)

The link seems weird. In fact, it's a relative link and not an absolute link. It means, the link is relative to the current page. No worries, Python provides a package and function to find the absolute link easily based on the url of the page.

In [None]:
from urllib.parse import urljoin
# In the general case, avoid putting imports in the middle of the code
# All imports must be at the top of the file
# However, this is a training file, so that's ok

next_link_absolute = urljoin(url, res_next_link_cleaned)
print(next_link_absolute)

Nice! Let's put that in a function that takes the url of a page and the html page and returns the next link. It will return None if it's the last page.

In [None]:
def get_next_link(url, page_html):
    # Write here the path to the next page.
    xpath_next_link = "//a[@data-page-number='next-page']/@href"
    # Retrieve the link to the next page
    res_next_link = page_html.xpath(xpath_next_link)
    
    # Check whether or not there is a link
    if len(res_next_link) > 0: # (i.e if the list is not empty)
        res_next_link_cleaned = clean_text(res_next_link) # Then clean the result
        next_link = urljoin(url, res_next_link_cleaned) # Get the absolute link
    else:
        next_link = None
    return next_link

Let's try it!

In [None]:
get_next_link(url, page_html)

Nice! We have everything now!

Let's recap:

For any Trustpilot url

<ul>
    <li>While we have a url to scrap (link is not None):</li>
    <ol>
        <li>Get the HTML of the page in the right format</li>
        <li>Scrap reviews of the page and store information</li>
        <li>Get the next link if possible, otherwise we have scrapped the reviews of the last page, and the link is None</li>
    </ol>
    <li>Return the results</li>
</ul>

In [None]:
def scrap_all_reviews(url):
    # Initialize 'next_url' that will be modified
    # It's better to not alter the url parameter
    next_url = url
    # Create a list to store the results
    all_reviews = []
    # Explore all the urls
    while next_url is not None:
        # 'Request' the HTML
        http_request = requests.get(next_url)
        # Retrieve its content
        page_content = http_request.content
        # Transform the HTML content to the right format
        page_html = html.fromstring(page_content)
        # Scrap the reviews of the page
        page_reviews = parse_page(page_html)
        # Store the scrapped reviews
        all_reviews += page_reviews
        # Display a message to show completion
        print(f"Done with {next_url}")
        # Get the url of the next page
        next_url = get_next_link(next_url, page_html)
    return all_reviews

In [None]:
url = "https://www.trustpilot.com/review/tripmate.com"
all_reviews = scrap_all_reviews(url)
print(f"Scrapped {len(all_reviews)} reviews")

Check that the total number of reviews scrapped matches the total number of reviews mentionned on the website. If it's not the case, try to investigate why. For instance, go to the last page scrapped and see if there are other reviews available in other languages but not displayed etc.

Awesome!

## Customer Reviews Analysis

In [None]:
# Package to handle the date
import pandas as pd

#Packages to display graphs
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(rc={'figure.figsize':(10, 5)})
# In the general case, avoid putting imports in the middle of the code
# All imports must be at the top of the file
# However, this is a training file, so that's ok

In [None]:
# Create a DataFrame (= basically a table)
df = pd.DataFrame(all_reviews)
# Display first 10 rows
df.head(10)

### Let's save our results!

In [None]:
from google.colab import drive
# Authenticate to tell Google Drive that you are in fact the owner of this Drive
drive.mount('drive')

In [None]:
# Give the path to the file where you want the reviews to be stored.
# The Folder should already exist, go create it if it's not the case.
filepath = "drive/My Drive/Training - Scraping/customer_reviews_TripMate.xlsx"
# Save results in that file
df.to_excel(filepath)

### Let's see how many reviews per rating the company got

In [None]:
df_rating = df.groupby("rating")["content"].count()
df_rating

In [None]:
df_rating = df_rating.reset_index()
df_rating

In [None]:
sns.barplot(x="rating", y="content", data=df_rating)
plt.show()

### Let's see how many reviews per month the company got

In [None]:
# First transform the date into a readable format

def get_year_month(date):
  """
  Function to get the year and the month from a date.
  Takes a string and returns a string.
  """
  return pd.to_datetime(date[:7]) # Get the first 7 characters (year and month) and transform it into a datetime object

df["date_year_month"] = df["date"].apply(get_year_month)
df.head()

In [None]:
# Plot the number of reviews per month

df_year_month = df.groupby("date_year_month")["content"].count().reset_index()
sns.lineplot(x="date_year_month", y="content", data=df_year_month)
plt.show()

In [None]:
df_year_month_rating = df.groupby(["date_year_month", "rating"])["content"].count().reset_index()

sns.lineplot(x="date_year_month", y="content", data=df_year_month_rating,
             hue="rating")
plt.show()

In [None]:
start_date = pd.to_datetime("2020-01-01")
end_date = pd.to_datetime("2020-10-01")

def is_within_select_period(date):
  return date >= start_date and date < end_date

df_select_period = df_year_month_rating[df_year_month_rating["date_year_month"].apply(is_within_select_period)]
df_select_period

In [None]:
sns.lineplot(x="date_year_month", y="content", data=df_select_period,
             hue="rating")
plt.show()

In [None]:
df_year_month_rating = df_select_period.set_index(["date_year_month", "rating"]).unstack(
                                fill_value=0
                            ).asfreq(
                                'MS', fill_value=0
                            ).stack().sort_index(level=0).reset_index()

df_year_month_rating

In [None]:
sns.lineplot(x="date_year_month", y="content", data=df_year_month_rating,
            hue="rating")
plt.show()

In [None]:
# Package to create wordclouds
from wordcloud import WordCloud

In [None]:
all_text_reviews = " ".join(df["content"])

wordcloud = WordCloud(width=400, height=600, background_color="white").generate(all_text_reviews)

plt.figure( figsize=(10,10) )
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()