<strong>
    <font color="#088A68">
        Author: lprtk
    </font>
</strong>

<br/>
<br/>


<Center>
    <h1 style="font-family: Arial">
        <font color="#084B8A">
            NLP: sentiment analysis, topic modeling & sentiment prediction
        </font>
    </h1>
    <h3 style="font-family: Arial">
        <font color="#088A68">
            Notebook 1/5
        </font>
    </h3>
</Center>

<br/>

<h3 style="font-family: Arial"><font color="#088A68">Paris 1 Panthéon-Sorbonne, M2 MoSEF 2021-2022</font></h3>

-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#084B8A">
            Introduction & context
        </font>
    </h2>
</div>

<p style="text-align: justify">
    This project focuses on extracting information and value from large volumes of textual data using Natural Language Processing (NLP). Why do you want to do this?
</p>
<ul>
    <li><p style="text-align: justify">To improve the customer experience on the website, mobile application or in the office.</p></li>
    <li><p style="text-align: justify">Assess customer satisfaction differently.</p></li>
    <li><p style="text-align: justify"></p>Evaluate the company's image.</li>
    <li><p style="text-align: justify"></p>Be more available and accessible to customers.</li>
    <li><p style="text-align: justify"></p>According to the company's activity: find new solutions to improve the banking services offered, evaluate the seller of an online sales platform or improve the product based on customer reviews.</li>
</ul>

<p style="text-align: justify">
    Our application approach is presented in 5 main streams:
</p>
<ul>
    <li>
        <u>Step 1:</u> Web Scraping
        <ul>
            <li>Collect and create the data schema.</li>
            <li>Parsing customer reviews to enrich the database: extracting title, description, date, time, nickname and rating.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 2:</u> Sentiment Analysis and Scoring
        <ul>
            <li>Understand and probe the satisfaction of each customer.</li>
            <li>Scoring the intensity and polarity of feelings from the review description.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 3:</u> Text mining and data cleaning
        <ul>
            <li>Text cleaning adapted to the sales domain and to the general content of reviews.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 4:</u> Topic Modeling (unsupervised learning)
        <ul>
            <li>To improve availability and speed up response time, reviews can be disassociated and prioritized according to the topic they address.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 5:</u> Machine Learning (supervised learning)
        <ul>
            <li>Without reading future reviews, design a robust model to identify the overall sentiment expressed by the customer.</li>
        </ul>
    </li>
</ul>

-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#084B8A">
            Librairies import
        </font>
    </h2>
</div>

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import warnings
warnings.filterwarnings("ignore")

-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#084B8A">
            Web scraping
        </font>
    </h2>
</div>

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#088A68">
            1) Function initialization
        </font>
    </h3>
</div>

In [2]:
def get_data(url: str) -> str:
    """
    Function that allows to do a web request to a web page and access to
    his content.

    Parameters
    ----------
    url : str
        Internet address: uniform resource locator (url).

    Returns
    -------
    str
        HTML body of a web page.

    """
    content = requests.get(url, headers=headers)
    
    return content.text


def html_code(url: str) -> bs4.BeautifulSoup:
    """
    Function that allows to parse the HTML body of a web page.

    Parameters
    ----------
    url : str
        Internet address: uniform resource locator (url).

    Returns
    -------
    bs4.BeautifulSoup
        Parsed HTML body of a web page.

    """
    html_data = get_data(url=url)
    soup = BeautifulSoup(html_data, "html.parser")

    return soup


def cus_data(soup: str, html_tag: str, html_class: str) -> list:
    """
    Function that allows to extract text from an HTML file parsed according
    to a precise location.

    Parameters
    ----------
    soup : bs4.BeautifulSoup
        Parsed HTML body of a web page.

    html_tag : str
        HTML tag which indicates what to scrape.

    html_class : str
        HTML class which indicates what to scrape.

    Returns
    -------
    list
        Scraped text.

    """
    cus_list = []
    for item in soup.find_all(html_tag, class_=html_class):
        data_str = item.get_text()
        cus_list.append(data_str)
        
    return cus_list


def scraped_data(soup: str) -> pd.core.frame.DataFrame:
    """
    Function that calls a scraping function and formats the scraped data
    into a pandas.core.frame.DataFrame.

    Parameters
    ----------
    soup : bs4.BeautifulSoup
        Parsed HTML body of a web page.

    Returns
    -------
    pandas.core.frame.DataFrame
        Scraped database.

    """
    pseudo = cus_data(
        soup=soup,
        html_tag="span",
        html_class="a-profile-name"
    )
    title = cus_data(
        soup=soup,
        html_tag="a",
        html_class="a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold"
    )
    review = cus_data(
        soup=soup,
        html_tag="span",
        html_class="a-size-base review-text review-text-content"
    )
    rating = cus_data(
        soup=soup,
        html_tag="span",
        html_class="a-icon-alt"
    )
    verified_purchase = cus_data(
        soup=soup,
        html_tag="span",
        html_class="a-size-mini a-color-state a-text-bold"
    )
    date = cus_data(
        soup=soup,
        html_tag="span",
        html_class="a-size-base a-color-secondary review-date"
    )
    product = cus_data(
        soup=soup,
        html_tag="div",
        html_class="a-row product-title"
    )
    
    df_data = pd.DataFrame(
        {
            "Pseudo": [""],
            "Title": [""],
            "Review": [""],
            "Rating": [""],
            "Verified Purchase": [""],
            "Date": [""]
        }
    )

    for i in range(len(title)):
        df_scraped = pd.DataFrame(
            {
                "Pseudo": [pseudo[i]],
                "Title": [title[i]],
                "Review": [review[i]],
                "Rating": [rating[i]],
                "Verified Purchase": [verified_purchase[i]],
                "Date": [date[i]]
            }
        )

        df_data = pd.concat([df_data, df_scraped], ignore_index=True)

    df_data.drop([0], axis=0, inplace=True)
    
    return df_data


def scraped_data_multipage(url: str) -> pd.core.frame.DataFrame:
    """
    Function that allows you to scrape across multiple review pages and
    formats the scraped datainto a pandas.core.frame.DataFrame.

    Parameters
    ----------
    url : str
        Internet address: uniform resource locator (url).

    Returns
    -------
    pandas.core.frame.DataFrame
        Scraped database.

    """
    page1 = "&pageNumber=1"
    page2 = "&pageNumber=2"
    page3 = "&pageNumber=3"
    page4 = "&pageNumber=4"
    page5 = "&pageNumber=5"
    
    soup = html_code(url=url+page1)
    data_page1 = scraped_data(soup=soup)

    soup = html_code(url=url+page2)
    data_page2 = scraped_data(soup=soup)

    soup = html_code(url=url+page3)
    data_page3 = scraped_data(soup=soup)

    soup = html_code(url=url+page4)
    data_page4 = scraped_data(soup=soup)

    soup = html_code(url=url+page5)
    data_page5 = scraped_data(soup=soup)
    
    df_data = pd.concat(
        [
            data_page1,
            data_page2,
            data_page3,
            data_page4,
            data_page5
        ]
    ).reset_index(drop=True)
    
    return df_data

In [3]:
headers = (
    {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
        AppleWebKit/537.36 (KHTML, like Gecko)\
        Chrome/90.0.4430.212 Safari/537.36",
        "Accept-Language": "en-US, en;q=0.5"
    }
)

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#088A68">
            2) Function application
        </font>
    </h3>
</div>

In [4]:
# list of urls to scrape
list_urls = [
    "https://www.amazon.com/Razer-Blade-14-Gaming-Laptop/product-reviews/B094658SMY/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/ASUS-Processor-NumberPad-Microsoft-L210MA-DB01/product-reviews/B081V6W99V/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/HP-Generation-i5-1135G7-Graphics-15-dy2024nr/product-reviews/B09FXFDGN3/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/ASUS-IPS-Type-i5-10300H-Keyboard-FX706LI-ES53/product-reviews/B08ZLC661T/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/Acer-i7-1165G7-Graphics-Antimicrobial-SF514-55TA-74EC/product-reviews/B08JQKMFFB/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/HP-Portable-Micro-Edge-Anti-Glare-14-fq1025nr/product-reviews/B09G8SK2KK/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/ASUS-VivoBook-R7-3700U-Fingerprint-F512DA-NH77/product-reviews/B085344M9Q/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/Dell-Inspiron-3000-Laptop-Celeron/product-reviews/B09F626YKW/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/Acer-AN515-55-53E5-i5-10300H-GeForce-Keyboard/product-reviews/B092YHJGMN/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/Lenovo-Chromebook-11-6-Inch-Processor-82HG0006US/product-reviews/B08T6N424Z/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/Apple-32GB-Space-Model-Refurbished/product-reviews/B074PWW6NS/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/ASUS-Chromebook-Spill-resistant-Transparent-CX1100CNA-AS42/product-reviews/B08XTB1NNH/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/Acer-A515-56-36UT-Display-i3-1115G4-Processor/product-reviews/B08VKT45K4/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/ASUS-Processor-NumberPad-Microsoft-L210MA-DB01/product-reviews/B081V6W99V/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/HP-Pavilion-Generation-i7-1165G7-15-eg0025nr/product-reviews/B09FX1YF28/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/Samsung-5100mAh-Battery-SM-T290-International/product-reviews/B07XJZ7VQD/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/HP-14-Laptop-Dual-Core-Processor/product-reviews/B09VRX9YVW/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/MSI-Stealth-15M-Gaming-Laptop/product-reviews/B091GGZT1S/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/Apple-MacBook-Retina-MPTR2LL-Renewed/product-reviews/B07JMLMVKP/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/Lenovo-Processor-Graphics-82HU00JWUS-Graphite/product-reviews/B09BG96KFJ/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/Samsung-Chromebook-XE500C13-K04US-Certified-Refurbished/product-reviews/B0759YSF4W/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/Samsung-Chromebook-Celeron-Processor-Gigabit/product-reviews/B07XQQTVS3/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/ASUS-Chromebook-Touchscreen-Processor-C433TA-AS384T/product-reviews/B08ZLF99VD/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews",
    "https://www.amazon.com/Acer-Chromebook-Celeron-Display-CB314-1H-C884/product-reviews/B0858N8CGX/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"
]

In [5]:
df_data = pd.DataFrame(
    {
        "Pseudo": [""],
        "Title": [""],
        "Review": [""],
        "Rating": [""],
        "Verified Purchase": [""],
        "Date": [""]
    }
)

for i in range(len(list_urls)):
    df_scraped = scraped_data_multipage(list_urls[i])
    df_data = pd.concat([df_data, df_scraped], ignore_index=True)

In [6]:
df_data.drop([0], axis=0, inplace=True)
df_data.reset_index(drop=True, inplace=True)

In [7]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1136 entries, 0 to 1135
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Pseudo             1136 non-null   object
 1   Title              1136 non-null   object
 2   Review             1136 non-null   object
 3   Rating             1136 non-null   object
 4   Verified Purchase  1136 non-null   object
 5   Date               1136 non-null   object
dtypes: object(6)
memory usage: 53.4+ KB


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#084B8A">
            Data export
        </font>
    </h2>
</div>

In [8]:
df_data.to_csv(path_or_buf="amzn_customer_reviews.csv", sep=",", index=False)