### What is Natural Language Processing? 
<p>Understanding, manipulating, and interpreting human language by machines is NLP. As humans, we express either verbally or in written and it carries a lot of information. Now to scale this situation to given geography we have a lot of words/information and we can draw some valuable insights. But the data generated is not normally structured and it is not in the form of rows and columns and it can have undefined record lengths. To convert this unstructured data into the structured format we use a process called <b>Text Mining</b>, to identify meaningful patterns and new insights in the data. We can find patterns and trends within unstructured data through the help of machine learning, statistics, and linguistics.  By transforming the data into a more structured format through text mining and text analysis, more quantitative insights can be found through text analytics. Data visualization techniques can then be harnessed to communicate findings to wider audiences.</p>


-----------------

### Part - 1 Text Mining
As we know to get valuable insights from the data we use text mining. But the first step is to get the text data. We use <b> Web Scrappig</b> to extract the data from the websites. Web Scraping extracts the underlying HTML code and data stored in the database 

In [1]:
import requests # to get the encoding, status, content of any website
from bs4 import BeautifulSoup # main library for scraping.
import pandas as pd # to deal with data

In [2]:
# url of ecommerce website product reviews
url = 'https://www.amazon.in/OnePlus-Nord-Gray-256GB-Storage/product-reviews/B08697WT6D/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews'

In [3]:
html_data = requests.get(url).text
# we have got the html content from this website in the form of text

In [5]:
soup = BeautifulSoup(html_data,'lxml') # the above data is converted to a format which python's bs4 can understand
# as beautifulsoup make parse trees the default parser is lxml

In [6]:
soup

<!DOCTYPE html>
<html class="a-no-js" data-19ax5a9jf="dingo" lang="en-in"><!-- sp:feature:head-start -->
<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>
<!-- sp:feature:cs-optimization -->
<meta content="on" http-equiv="x-dns-prefetch-control"/>
<link href="https://images-eu.ssl-images-amazon.com" rel="dns-prefetch"/>
<link href="https://m.media-amazon.com" rel="dns-prefetch"/>
<link href="https://completion.amazon.com" rel="dns-prefetch"/>
<!-- sp:feature:aui-assets -->
<link href="https://images-eu.ssl-images-amazon.com/images/I/11EIQ5IGqaL._RC|012LjolmrML.css,418YjvsUB+L.css,21qPwhPKAAL.css,01Vctty9pOL.css,017DsKjNQJL.css,0131vqwP5UL.css,41EWOOlBJ9L.css,11TIuySqr6L.css,01ElnPiDxWL.css,11bGSgD5pDL.css,01Dm5eKVxwL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,01ZfXnjPmmL.css,01oDR3IULNL.css,31MKqadzl-L.css,01XPHJk60-L.css,01R0k0yxPXL.css,21xVR0NtxzL.css,11gneA3MtJL.css,21fecG8pUzL.css,01RddH8vm-L.css,01CFUgsA-YL.css,21AmhU6t0sL.css,11zGrJZ9D2L.css,11tRp6+0

In [7]:
# finding names in this html code we use the syntax findall(*params)
names = soup.find_all('span',{'class':'a-profile-name'})
names

[<span class="a-profile-name">Kiran KS</span>,
 <span class="a-profile-name">Abhishek Agarwal</span>,
 <span class="a-profile-name">Abhishek Agarwal</span>,
 <span class="a-profile-name">Kiran KS</span>,
 <span class="a-profile-name">Aman More</span>,
 <span class="a-profile-name">Nikhil</span>,
 <span class="a-profile-name">Sreeram</span>,
 <span class="a-profile-name">Deblina Roy</span>,
 <span class="a-profile-name">Nitin Mittal</span>,
 <span class="a-profile-name">Amit</span>,
 <span class="a-profile-name">Sudeshna Dey</span>,
 <span class="a-profile-name">Krithika</span>]

In [8]:
# Putting the data into proper structure
profile_names = []
for i in names:
    profile_names.append(i.text)
profile_names = pd.Series(profile_names,name="User_Name")
profile_names = profile_names[2:].reset_index(drop = True)
profile_names

0    Abhishek Agarwal
1            Kiran KS
2           Aman More
3              Nikhil
4             Sreeram
5         Deblina Roy
6        Nitin Mittal
7                Amit
8        Sudeshna Dey
9            Krithika
Name: User_Name, dtype: object

In [9]:
# get the title
title = soup.find_all('a',{'data-hook':'review-title'})

In [10]:
Review_title = []
for i in title:
    tag = i.find('span')
    text = tag.text
    Review_title.append(text)

In [11]:
Review_title = pd.Series(Review_title,name = "Title")
Review_title

0                                       Bad bad camera
1                     The original segment of One Plus
2                                 A good daily driver.
3                              *Read before you buy!!*
4                 Totally dissatisfied. No AR support.
5                                        Disappointing
6    Empty mobile box with all.missomg contents was...
7                        Near to mid range  Perfection
8                                     Bad front camera
9                                         Great price!
Name: Title, dtype: object

In [12]:
# get the review
review = soup.find_all("span",{"data-hook":"review-body"})

In [13]:
review_text = []
for i in review:
    text = i.text
    review_text.append(text)

In [14]:
review_text = pd.Series(review_text,name="Review data")
review_text


0    \n\n  It's not very often I leave a critical r...
1    \n\n  Battery usage update: Drains faster than...
2    \n\n  Pros:1) Clean and bloatfree OxygenOS, wh...
3    \n\n  Yea..pre-ordered on 28 July, got it on 4...
4    \n\n  I bought this phone for augmented realit...
5    \n\n  Heavily disappointed. So much of hype an...
6    \n\n\n  Your browser does not support HTML5 vi...
7    \n\n  Got it delivered yesterday , used for ab...
8    \n\n  Front camera is very bad , and low light...
9    \n\n  An amazing phone!Got it delivered today ...
Name: Review data, dtype: object

In [15]:
# get the review date
review_date = soup.findAll("span",{"data-hook":"review-date"})

In [16]:
reviews_date = []
for i in review_date:
    reviews_date.append(i.text)
Date = pd.Series(reviews_date,name="Date")
Date

0     Reviewed in India on 2 August 2020
1     Reviewed in India on 3 August 2020
2     Reviewed in India on 3 August 2020
3     Reviewed in India on 4 August 2020
4     Reviewed in India on 3 August 2020
5     Reviewed in India on 4 August 2020
6    Reviewed in India on 30 August 2020
7     Reviewed in India on 3 August 2020
8     Reviewed in India on 3 August 2020
9     Reviewed in India on 4 August 2020
Name: Date, dtype: object

In [17]:
Date = Date.apply(lambda x:x.replace("Reviewed in India on ",""))

In [18]:
Date

0     2 August 2020
1     3 August 2020
2     3 August 2020
3     4 August 2020
4     3 August 2020
5     4 August 2020
6    30 August 2020
7     3 August 2020
8     3 August 2020
9     4 August 2020
Name: Date, dtype: object

In [19]:
# rating of product
ratings = soup.find_all("i",{"data-hook":"review-star-rating"})

In [20]:
Star_rating = []
for i in ratings:
    Star_rating.append(i.text)

Ratings = pd.Series(Star_rating,name="Ratings")

In [21]:
Ratings = Ratings.apply(lambda x: x.replace('out of 5 stars',''))
Ratings

0    3.0 
1    4.0 
2    4.0 
3    5.0 
4    1.0 
5    2.0 
6    1.0 
7    5.0 
8    1.0 
9    5.0 
Name: Ratings, dtype: object

In [22]:
# Concatenating into a final data frame
Final_data = pd.concat([profile_names,Review_title,review_text,Date,Ratings],axis = 1)
Final_data

Unnamed: 0,User_Name,Title,Review data,Date,Ratings
0,Abhishek Agarwal,Bad bad camera,\n\n It's not very often I leave a critical r...,2 August 2020,3.0
1,Kiran KS,The original segment of One Plus,\n\n Battery usage update: Drains faster than...,3 August 2020,4.0
2,Aman More,A good daily driver.,"\n\n Pros:1) Clean and bloatfree OxygenOS, wh...",3 August 2020,4.0
3,Nikhil,*Read before you buy!!*,"\n\n Yea..pre-ordered on 28 July, got it on 4...",4 August 2020,5.0
4,Sreeram,Totally dissatisfied. No AR support.,\n\n I bought this phone for augmented realit...,3 August 2020,1.0
5,Deblina Roy,Disappointing,\n\n Heavily disappointed. So much of hype an...,4 August 2020,2.0
6,Nitin Mittal,Empty mobile box with all.missomg contents was...,\n\n\n Your browser does not support HTML5 vi...,30 August 2020,1.0
7,Amit,Near to mid range Perfection,"\n\n Got it delivered yesterday , used for ab...",3 August 2020,5.0
8,Sudeshna Dey,Bad front camera,"\n\n Front camera is very bad , and low light...",3 August 2020,1.0
9,Krithika,Great price!,\n\n An amazing phone!Got it delivered today ...,4 August 2020,5.0


In [23]:
Final_data.to_excel()

TypeError: to_excel() missing 1 required positional argument: 'excel_writer'

### Web Scraping - Points to Remember
- Get the source code.
- Get the parsed data with the help of Beautiful Soup library.
- Input the tag to get text.
- Put it in a proper data structure.
------