In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [2]:
df = pd.read_csv("reviews_trustpilot.csv", index_col = None)
df.shape

(16114, 1)

In [3]:
df.head(20)

Unnamed: 0,0
0,1.2
1,
2,Visit this website
3,"5,389 total"
4,5-star
5,2%
6,4-star
7,<1%
8,3-star
9,<1%


In [4]:
df.tail(10)

Unnamed: 0,0
16104,"Date of experience: November 30, 2016"
16105,Claim your profile to access Trustpilot’s free business tools and connect with customers.
16106,Claim your profile to access Trustpilot’s free business tools and connect with customers.
16107,
16108,
16109,
16110,
16111,
16112,
16113,


In [5]:
# Removing additional headers and tails

df = df.loc[16:16104,:]
df.shape

(16089, 1)

In [6]:
df.head(5)

Unnamed: 0,0
16,Signed up for MyMail just now. I had to use a different e-mail address because the recovery process for my long-forgotten e-Post username and password crashed. Created new profile under a different e-mail only to discover MyMail is not available for my address. COMPLETE CANADA POST APP-TESTING FAIL!!!
17,"Date of experience: June 08, 2023"
18,"Sent an iPhone expresspost to Philippines from Ontario. Came back 14 days later with sticker ✅ dangerous goods. I told the guy at the counter it was a used iPhone in original box, and he does not know it is classed as dangerous goods? If he does not know, how should I know? 🙄"
19,"Date of experience: June 08, 2023"
20,"Canada post out of order and is a reflection of our Liberal government who don’t give a dam about its citizens. Don’t forget, we sign your pay cheque. Do better."


In [7]:
df.tail(5)

Unnamed: 0,0
16100,"Date of experience: December 06, 2016"
16101,"Crown Corporation opt out: Important: We're working hard to process and deliver record holiday parcel volumes as quickly as possible. Please note that it may take up to 24 hours for customers to see tracking information on our website. In some cases, customers may also experience a delay in delivery. We continue to devote extra resources to serve you and apologize for any delays\n\nFull disclosure aside, why is it that a package dispatched from Mississauga to an address in Mississauga, a distance of maximum seven kilometres, was sent instead to Stoney Creek for sorting, a round trip of about 100 kilometres. The package has yet to be delivered over four days past scheduled delivery. Am I missing something subtle here? We all know that we would be sacked if we did something as stupid as this where we work.\nAnd talking of deliveries, a very expensive laptop was left on our porch as the Canada Post delivery man could not be bothered to even ring the doorbell. Perhaps the powers that be, occupying jobs for life, ought to monitor news reports of people cruising streets to steal parcels from porches."
16102,"Date of experience: December 03, 2016"
16103,"Canada post fails to deliver packages on time on the regular, spending time ""processing"" the package for much too long. When packages arrive, they often don't even knock and just leave their little paper slip. Government employees should not be in charge of mail, and Canada Post shows what happens when you let incompetence run the show."
16104,"Date of experience: November 30, 2016"


In [8]:
# Remove blank rows (missing - Nan values)

df = df.dropna(axis=0)
df.shape

(14137, 1)

In [9]:
# locating the index of number of stars

star_index = df[(df["0"]== "1-star") | (df["0"] == "2-star") | (df["0"]== "3-star") | 
                (df["0"]== "4-star") | (df["0"]== "5-star")].index

In [10]:
# dropping the rows with the summary starts of each page 

df = df.drop(star_index, axis=0)
df.shape

(12917, 1)

In [11]:
str_dates = df[df["0"].str.contains('Date of experience:')]
dates_index = str_dates.index

print(f" Last review date:  {str_dates.head(1).values}\n First review date: {str_dates.tail(1).values}")

 Last review date:  [['Date of experience: June 08, 2023']]
 First review date: [['Date of experience: November 30, 2016']]


In [12]:
# dropping the rows with dates

df = df.drop(dates_index, axis=0)
df.shape

(8017, 1)

In [13]:
# Let's find the most repeated, non intuitive words or phrases
df["0"].value_counts()[0:11]

<1%                                                                                                                                                                                                                                                                                                                                                                                              488
2%                                                                                                                                                                                                                                                                                                                                                                                               488
Claim your profile to access Trustpilot’s free business tools and connect with customers.                                                                                                                                     

In [14]:
todrop_index = df[(df["0"]== "<1%") | (df["0"] == "2%") |
                  (df["0"]== "Claim your profile to access Trustpilot’s free business tools and connect with customers.") | 
                  (df["0"]== "See Trustpilot reviews directly in your Google searches") | 
                  (df["0"]== "Most recent") |
                  (df["0"]== "Filter") |
                  (df["0"]== "95%") |
                  (df["0"]== "5,389 total") |
                  (df["0"]== "Visit this website") |
                  (df["0"]== "1.2") |
                  (df["0"]== "We use cookies to personalize content and ads, to provide social media features, and to analyze our traffic. We also share information about your use of our site with our partners in social media, advertising, and analytics. By continuing to use our website, you accept the use of all cookies. You can always access and change your cookie preferences in the footer of this website.")
                 ].index

In [15]:
df = df.drop(todrop_index, axis=0)
df.shape

(4830, 1)

In [16]:
df.sample(3)

Unnamed: 0,0
8564,"What is up with these guys? Expedited package, received a tracking with delivery of May 28 before end of day, mail delivery dor June 3 is here and the package still not delivered!!!!Now tracking says do not know when delivery will happen, it shows it was received in my city days and days ago. That's it, I will no longer use this service, not worth the money, they do not deliver"
643,"What is going on at Canada Post? My package was processed at their factory in Kitchener on February 14th. Since then it gets delayed by a day and since yesterday it's been ""in transit"" for the last 24 hours. Now their terrible tracking system just says ""delayed"" with no date in sight. I get delays happen since my last 2 orders were late by 2 days but this is getting ridiculous. I'm guessing I won't get my package until February 21st since this Monday is a holiday."
15142,REALLY CANADA POST????\nI sent a letter registered mail from Thunder Bay to Toronto - so it won't get lost (or hopefully not) on December 23rd and it says the expected delivery date is January 2nd?\nI could probably walk with the letter to Toronto and get it there faster.


In [17]:
# Checking the number of duplicated comments
df.duplicated().sum()

8

In [18]:
df = df.drop_duplicates()
df.shape

(4822, 1)

In [19]:
df.to_csv('reviews_trustpilot_clean.csv',index=False)