### Web Scrapping

In [1]:
import requests

In [2]:
from bs4 import BeautifulSoup

In [3]:
r = requests.get('https://www.yelp.com/biz/tesla-san-francisco?osq=Tesla+Dealership')

In [4]:
# Check request status
print(r.status_code) #If this returns anything other than 200, check that the url you’ve got is valid and correctly formed.

200


In [5]:
r.text

'<!DOCTYPE html><html lang="en-US" prefix="og: http://ogp.me/ns#" style="margin: 0;padding: 0; border: 0; font-size: 100%; font: inherit; vertical-align: baseline;"><head><script>document.documentElement.className=document.documentElement.className.replace(/\x08no-js\x08/,"js");</script><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><meta http-equiv="Content-Language" content="en-US" /><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><link rel="mask-icon" sizes="any" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/b2bb2fb0ec9c/assets/img/logos/yelp_burst.svg" content="#FF1A1A"><link rel="shortcut icon" href="https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_large_assets/dcfe403147fc/assets/img/logos/favicon.ico"><script> window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;window.ygaPageStartTime=new Date().getTime();</script><script>\n            window.yelp = window.yelp || {};\

In [6]:
# Make the soup
soup = BeautifulSoup(r.text, 'html.parser')

#### html.parser' is a parser provided by the BeautifulSoup library in Python, and it's used to parse HTML and convert it into a navigable Python data structure. When you retrieve HTML content from a web page, it's in the form of raw text. The purpose of a parser like 'html.parser' is to process this raw HTML text and create a hierarchical structure that you can navigate and manipulate using Python code.

In [7]:
# First get all of the review-content divs
results = soup.findAll(class_='comment__09f24__D0cxf css-qgunke')
#This will return all of the divs that have a class of ‘review-content’.
print(results)

[<p class="comment__09f24__D0cxf css-qgunke"><span class="raw__09f24__T4Ezm" lang="en">Wow! The best tesla service center I have ever been to. In previous experiences in Berkeley and LA, it takes over 2 months to get an appointment but I was able to schedule one here for the next week. They had coffee and snacks and my repair was done in 30 min. I didn't catch his name but the person working on my car was a younger asian man with glasses. I am definitely going to look for him next time I have to come in. I had such a hatred for tesla service centers before but now I am glad to know I can rely on this location!!</span></p>, <p class="comment__09f24__D0cxf css-qgunke"><span class="raw__09f24__T4Ezm" lang="en">y'all cars ain't shit if it's able to hit another car  with children inside<br/><br/>a person crashed 20 teslas and y'all still haven't gotten it right ! <br/><br/>the design of the car needs some more alterations</span></p>, <p class="comment__09f24__D0cxf css-qgunke"><span class="

In [8]:
# # First get all of the review-content divs
# results = soup.findAll(class_='raw__09f24__T4Ezm')
# #This will return all of the divs that have a class of ‘review-content’.
# print(results)

#### Then we can loop through each div found and use the find function to get every paragraph and store it in a list.

In [9]:
reviews = []
for result in results:
    reviews.append((result.find('span', class_='raw__09f24__T4Ezm')).text)
    

In [10]:
for review in reviews:
    print(review,"\n")

Wow! The best tesla service center I have ever been to. In previous experiences in Berkeley and LA, it takes over 2 months to get an appointment but I was able to schedule one here for the next week. They had coffee and snacks and my repair was done in 30 min. I didn't catch his name but the person working on my car was a younger asian man with glasses. I am definitely going to look for him next time I have to come in. I had such a hatred for tesla service centers before but now I am glad to know I can rely on this location!! 

y'all cars ain't shit if it's able to hit another car  with children insidea person crashed 20 teslas and y'all still haven't gotten it right ! the design of the car needs some more alterations 

Really poor service. I took my car in to get the front passenger door fixed on Monday and was told the part was delivered damaged so I needed to reschedule the appointment for Friday and they would do it via mobile service. So the service center scheduled a 8am-12pm Fri

In [11]:
reviews[0]

"Wow! The best tesla service center I have ever been to. In previous experiences in Berkeley and LA, it takes over 2 months to get an appointment but I was able to schedule one here for the next week. They had coffee and snacks and my repair was done in 30 min. I didn't catch his name but the person working on my car was a younger asian man with glasses. I am definitely going to look for him next time I have to come in. I had such a hatred for tesla service centers before but now I am glad to know I can rely on this location!!"

In [12]:
import pandas as pd
import numpy as np

In [13]:
# Create a pandas dataframe from array
df = pd.DataFrame(np.array(reviews), columns=['review'])

In [14]:
df.head(5)

Unnamed: 0,review
0,Wow! The best tesla service center I have ever...
1,y'all cars ain't shit if it's able to hit anot...
2,Really poor service. I took my car in to get t...
3,Helena KElon Musk!Is climbing the highest moun...
4,In a nutshell: Tesla sucks! I leased one of th...


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  10 non-null     object
dtypes: object(1)
memory usage: 208.0+ bytes


In [16]:
len(df["review"])

10

In [17]:
# Calculate word count
df['word_count'] = df['review'].apply(lambda x: len(str(x).split(" ")))

In [18]:
df["word_count"]

0    109
1     35
2    111
3    125
4    136
5     59
6     51
7     49
8     73
9     83
Name: word_count, dtype: int64

In [19]:
df.head()

Unnamed: 0,review,word_count
0,Wow! The best tesla service center I have ever...,109
1,y'all cars ain't shit if it's able to hit anot...,35
2,Really poor service. I took my car in to get t...,111
3,Helena KElon Musk!Is climbing the highest moun...,125
4,In a nutshell: Tesla sucks! I leased one of th...,136


In [20]:
# Calculate character count
df['char_count'] = df['review'].apply(lambda x: len(x))

In [21]:
df["char_count"]

0    531
1    193
2    632
3    745
4    734
5    278
6    296
7    270
8    350
9    459
Name: char_count, dtype: int64

In [22]:
df.head()

Unnamed: 0,review,word_count,char_count
0,Wow! The best tesla service center I have ever...,109,531
1,y'all cars ain't shit if it's able to hit anot...,35,193
2,Really poor service. I took my car in to get t...,111,632
3,Helena KElon Musk!Is climbing the highest moun...,125,745
4,In a nutshell: Tesla sucks! I leased one of th...,136,734


In [23]:
#Average word length – the average length of words used
def avg_word_len(review):
    words = review.split()
    return (sum(len(word) for word in words) / len(words))

# Calculate average words
df['avg_word_len'] = df['review'].apply(avg_word_len)

In [24]:
df["avg_word_len"]

0    3.880734
1    4.514286
2    4.702703
3    4.960000
4    4.404412
5    3.728814
6    4.823529
7    4.530612
8    3.808219
9    4.542169
Name: avg_word_len, dtype: float64

In [25]:
df.head()

Unnamed: 0,review,word_count,char_count,avg_word_len
0,Wow! The best tesla service center I have ever...,109,531,3.880734
1,y'all cars ain't shit if it's able to hit anot...,35,193,4.514286
2,Really poor service. I took my car in to get t...,111,632,4.702703
3,Helena KElon Musk!Is climbing the highest moun...,125,745,4.96
4,In a nutshell: Tesla sucks! I leased one of th...,136,734,4.404412


In [26]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Gulshan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [27]:
#Stopword Count – total number of words which are considered stop words
# Import stopwords
from nltk.corpus import stopwords

In [28]:
# Calculate number of stop words
stop_words = stopwords.words('english')

In [29]:
df.head()

Unnamed: 0,review,word_count,char_count,avg_word_len
0,Wow! The best tesla service center I have ever...,109,531,3.880734
1,y'all cars ain't shit if it's able to hit anot...,35,193,4.514286
2,Really poor service. I took my car in to get t...,111,632,4.702703
3,Helena KElon Musk!Is climbing the highest moun...,125,745,4.96
4,In a nutshell: Tesla sucks! I leased one of th...,136,734,4.404412


In [30]:
def stopword(x):
    countstop=[]
    words=x.split()
    for word in words:
        if word.lower() in stop_words:
            countstop.append(word)
    return len(countstop)

In [31]:
df['stopword_count']=df["review"].apply(stopword)
#Or
#df['stopword_coun'] = df['review'].apply(lambda x: len([x for x in x.split() if x in stop_words]))

In [32]:
df.head()

Unnamed: 0,review,word_count,char_count,avg_word_len,stopword_count
0,Wow! The best tesla service center I have ever...,109,531,3.880734,58
1,y'all cars ain't shit if it's able to hit anot...,35,193,4.514286,12
2,Really poor service. I took my car in to get t...,111,632,4.702703,49
3,Helena KElon Musk!Is climbing the highest moun...,125,745,4.96,48
4,In a nutshell: Tesla sucks! I leased one of th...,136,734,4.404412,65


In [33]:
df["stopword_rate"]=df["word_count"]/df["stopword_count"]

In [34]:
df.head()

Unnamed: 0,review,word_count,char_count,avg_word_len,stopword_count,stopword_rate
0,Wow! The best tesla service center I have ever...,109,531,3.880734,58,1.87931
1,y'all cars ain't shit if it's able to hit anot...,35,193,4.514286,12,2.916667
2,Really poor service. I took my car in to get t...,111,632,4.702703,49,2.265306
3,Helena KElon Musk!Is climbing the highest moun...,125,745,4.96,48,2.604167
4,In a nutshell: Tesla sucks! I leased one of th...,136,734,4.404412,65,2.092308


In [35]:
df.sort_values(by="stopword_rate")

Unnamed: 0,review,word_count,char_count,avg_word_len,stopword_count,stopword_rate
0,Wow! The best tesla service center I have ever...,109,531,3.880734,58,1.87931
5,I waited for 25 mins and no one even acknowled...,59,278,3.728814,31,1.903226
8,"Well, I had an issue with my Tesla. Took it in...",73,350,3.808219,38,1.921053
9,Nick has been amazing in educating us about th...,83,459,4.542169,41,2.02439
4,In a nutshell: Tesla sucks! I leased one of th...,136,734,4.404412,65,2.092308
6,Delivery and customer service experience is be...,51,296,4.823529,24,2.125
2,Really poor service. I took my car in to get t...,111,632,4.702703,49,2.265306
3,Helena KElon Musk!Is climbing the highest moun...,125,745,4.96,48,2.604167
1,y'all cars ain't shit if it's able to hit anot...,35,193,4.514286,12,2.916667
7,I took back my 2018 Model 3 last month for saf...,49,270,4.530612,15,3.266667


### Data Cleaning

In [36]:
# Lower case all words
df['review_lower'] = df['review'].apply(lambda x: " ".join(x.lower() for x in x.split()))

In [37]:
df["review_lower"]

0    wow! the best tesla service center i have ever...
1    y'all cars ain't shit if it's able to hit anot...
2    really poor service. i took my car in to get t...
3    helena kelon musk!is climbing the highest moun...
4    in a nutshell: tesla sucks! i leased one of th...
5    i waited for 25 mins and no one even acknowled...
6    delivery and customer service experience is be...
7    i took back my 2018 model 3 last month for saf...
8    well, i had an issue with my tesla. took it in...
9    nick has been amazing in educating us about th...
Name: review_lower, dtype: object

In [38]:
df.head()

Unnamed: 0,review,word_count,char_count,avg_word_len,stopword_count,stopword_rate,review_lower
0,Wow! The best tesla service center I have ever...,109,531,3.880734,58,1.87931,wow! the best tesla service center i have ever...
1,y'all cars ain't shit if it's able to hit anot...,35,193,4.514286,12,2.916667,y'all cars ain't shit if it's able to hit anot...
2,Really poor service. I took my car in to get t...,111,632,4.702703,49,2.265306,really poor service. i took my car in to get t...
3,Helena KElon Musk!Is climbing the highest moun...,125,745,4.96,48,2.604167,helena kelon musk!is climbing the highest moun...
4,In a nutshell: Tesla sucks! I leased one of th...,136,734,4.404412,65,2.092308,in a nutshell: tesla sucks! i leased one of th...


In [39]:
# Remove Punctuation
df['review_nopunc'] = df['review_lower'].str.replace('[^\w\s]', '')

  df['review_nopunc'] = df['review_lower'].str.replace('[^\w\s]', '')


In [40]:
# Import stopwords
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# Remove Stopwords
df['review_nopunc_nostop'] = df['review_nopunc'].apply(lambda x: " ".join(word for word in x.split() if word not in stop_words))

In [41]:
df["review_nopunc_nostop"]

0    wow best tesla service center ever previous ex...
1    yall cars aint shit able hit another car child...
2    really poor service took car get front passeng...
3    helena kelon muskis climbing highest mount wor...
4    nutshell tesla sucks leased one model ys 2021 ...
5    waited 25 mins one even acknowledged show room...
6    delivery customer service experience beyond ho...
7    took back 2018 model 3 last month safety recal...
8    well issue tesla took service center thursday ...
9    nick amazing educating us teslas answering man...
Name: review_nopunc_nostop, dtype: object

In [42]:
df.head()

Unnamed: 0,review,word_count,char_count,avg_word_len,stopword_count,stopword_rate,review_lower,review_nopunc,review_nopunc_nostop
0,Wow! The best tesla service center I have ever...,109,531,3.880734,58,1.87931,wow! the best tesla service center i have ever...,wow the best tesla service center i have ever ...,wow best tesla service center ever previous ex...
1,y'all cars ain't shit if it's able to hit anot...,35,193,4.514286,12,2.916667,y'all cars ain't shit if it's able to hit anot...,yall cars aint shit if its able to hit another...,yall cars aint shit able hit another car child...
2,Really poor service. I took my car in to get t...,111,632,4.702703,49,2.265306,really poor service. i took my car in to get t...,really poor service i took my car in to get th...,really poor service took car get front passeng...
3,Helena KElon Musk!Is climbing the highest moun...,125,745,4.96,48,2.604167,helena kelon musk!is climbing the highest moun...,helena kelon muskis climbing the highest mount...,helena kelon muskis climbing highest mount wor...
4,In a nutshell: Tesla sucks! I leased one of th...,136,734,4.404412,65,2.092308,in a nutshell: tesla sucks! i leased one of th...,in a nutshell tesla sucks i leased one of thei...,nutshell tesla sucks leased one model ys 2021 ...


In [43]:
# Return frequency of values
freq= pd.Series(" ".join(df['review_nopunc_nostop']).split()).value_counts()[:30]

In [44]:
other_stopwords = ['get', 'us', 'see', 'use', 'said', 'asked', 'day', 'go' \
  'even', 'ive', 'right', 'left', 'always', 'would', 'told', \
  'get', 'us', 'would', 'get', 'one', 'ive', 'go', 'even', \
  'also', 'ever', 'x', 'take', 'let' ]

In [45]:
len(other_stopwords)

27

In [46]:
df["clean_review"]=df["review_nopunc_nostop"].apply(lambda x: " ".join(word for word in x.split() if word not in other_stopwords))

### LEMMATIZE THE REVIEWS

#### It’s the process of translating words back to their base form. Lemmatization Example:

#### 1.am, are, is would be lemmatized to be
#### 2.car, cars, car’s, cars’ would be lemmatized to car
#### This cuts out the number of words that are available for analysis by combining similar forms into one base form. One of other processes that is commonly used to cut down the the number of unique words in natural text processing is a process called stemming.

#### Stemming shortens the number of unique words by removing common endings.

#### Example:
#### 1.Caresses is stemmed to caress
#### 2.Ponies is stemmed to poni
#### Some words can stand alone without the extended ending however as shown with the word ponies above, this is not always the case. In this case we’ll use lemmatization to shorten down our word lsit.

In [47]:
#The text blob module provides a simple method to lemmatize the reviews.
!pip install textblob



You should consider upgrading via the 'C:\Users\Gulshan\anaconda3\python.exe -m pip install --upgrade pip' command.


In [48]:
# Import textblob
from textblob import Word

In [49]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Gulshan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [50]:
# Lemmatize final review format
df['lemmatize_review'] = df['clean_review'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

In [51]:
# Once you've imported the Word class, you can create an instance of it by passing a word as an argument. For example:

# word = Word("running")
# After creating the instance, you can perform various operations on the word. One common operation is lemmatization, which reduces the word to its base or dictionary form. For example:

# lemmatized_word = word.lemmatize()  # Returns "run"

### Sentiment Analysis

In [52]:
# The textblob module again comes in quite handy for this task and returns not only sentiment metric 
# but also a subjectivity metric as well.

# The polarity metric refers to the degree to which the text analysed is positive or negative, 
# between a range of -1 to 1. A score of 1 means highly positive whereas -1 is considered well and truly negative.

In [53]:
# Calculate polarity
from textblob import TextBlob

In [54]:
df['lemmatize_review'].apply(lambda x: TextBlob(x).sentiment)

0    (0.14444444444444446, 0.31597222222222227)
1                                (0.15, 0.7125)
2     (-0.16666666666666666, 0.554320987654321)
3     (0.28500000000000003, 0.6066666666666667)
4                  (-0.145, 0.5158333333333334)
5                  (-0.08181818181818182, 0.45)
6    (-0.07500000000000001, 0.7284722222222222)
7                    (0.2, 0.31333333333333335)
8                     (0.0, 0.4133333333333334)
9       (0.5020661157024793, 0.640220385674931)
Name: lemmatize_review, dtype: object

In [55]:
df['polarity'] = df['lemmatize_review'].apply(lambda x: TextBlob(x).sentiment[0])

In [56]:
# We can also analyse subjectivity, 
# this is the degree to which the text analysed relates to personal emotion 
# or factual information between a scale of 0 to 1. 
# With scores closer to one indicating a higher level of subjectivity and being based mostly on opinion.

In [57]:
# Calculate subjectivity
df['subjectivity'] = df['lemmatize_review'].apply(lambda x: TextBlob(x).sentiment[1])

In [58]:
df.head()

Unnamed: 0,review,word_count,char_count,avg_word_len,stopword_count,stopword_rate,review_lower,review_nopunc,review_nopunc_nostop,clean_review,lemmatize_review,polarity,subjectivity
0,Wow! The best tesla service center I have ever...,109,531,3.880734,58,1.87931,wow! the best tesla service center i have ever...,wow the best tesla service center i have ever ...,wow best tesla service center ever previous ex...,wow best tesla service center previous experie...,wow best tesla service center previous experie...,0.144444,0.315972
1,y'all cars ain't shit if it's able to hit anot...,35,193,4.514286,12,2.916667,y'all cars ain't shit if it's able to hit anot...,yall cars aint shit if its able to hit another...,yall cars aint shit able hit another car child...,yall cars aint shit able hit another car child...,yall car aint shit able hit another car child ...,0.15,0.7125
2,Really poor service. I took my car in to get t...,111,632,4.702703,49,2.265306,really poor service. i took my car in to get t...,really poor service i took my car in to get th...,really poor service took car get front passeng...,really poor service took car front passenger d...,really poor service took car front passenger d...,-0.166667,0.554321
3,Helena KElon Musk!Is climbing the highest moun...,125,745,4.96,48,2.604167,helena kelon musk!is climbing the highest moun...,helena kelon muskis climbing the highest mount...,helena kelon muskis climbing highest mount wor...,helena kelon muskis climbing highest mount wor...,helena kelon muskis climbing highest mount wor...,0.285,0.606667
4,In a nutshell: Tesla sucks! I leased one of th...,136,734,4.404412,65,2.092308,in a nutshell: tesla sucks! i leased one of th...,in a nutshell tesla sucks i leased one of thei...,nutshell tesla sucks leased one model ys 2021 ...,nutshell tesla sucks leased model ys 2021 tech...,nutshell tesla suck leased model y 2021 techni...,-0.145,0.515833


In [60]:
df.columns

Index(['review', 'word_count', 'char_count', 'avg_word_len', 'stopword_count',
       'stopword_rate', 'review_lower', 'review_nopunc',
       'review_nopunc_nostop', 'clean_review', 'lemmatize_review', 'polarity',
       'subjectivity'],
      dtype='object')

In [61]:
df.drop(['review_lower', 'review_nopunc',
       'review_nopunc_nostop', 'clean_review', 'lemmatize_review'], axis=1, inplace=True)

In [63]:
df.sort_values(by='polarity')

Unnamed: 0,review,word_count,char_count,avg_word_len,stopword_count,stopword_rate,polarity,subjectivity
2,Really poor service. I took my car in to get t...,111,632,4.702703,49,2.265306,-0.166667,0.554321
4,In a nutshell: Tesla sucks! I leased one of th...,136,734,4.404412,65,2.092308,-0.145,0.515833
5,I waited for 25 mins and no one even acknowled...,59,278,3.728814,31,1.903226,-0.081818,0.45
6,Delivery and customer service experience is be...,51,296,4.823529,24,2.125,-0.075,0.728472
8,"Well, I had an issue with my Tesla. Took it in...",73,350,3.808219,38,1.921053,0.0,0.413333
0,Wow! The best tesla service center I have ever...,109,531,3.880734,58,1.87931,0.144444,0.315972
1,y'all cars ain't shit if it's able to hit anot...,35,193,4.514286,12,2.916667,0.15,0.7125
7,I took back my 2018 Model 3 last month for saf...,49,270,4.530612,15,3.266667,0.2,0.313333
3,Helena KElon Musk!Is climbing the highest moun...,125,745,4.96,48,2.604167,0.285,0.606667
9,Nick has been amazing in educating us about th...,83,459,4.542169,41,2.02439,0.502066,0.64022
