In [1]:
import json
import csv 
import tweepy
import re
from tqdm import tqdm
import snscrape.modules.twitter as sntwitter
import pandas as pd
import matplotlib.pyplot as plt 
import spacy 
import nltk 
from nltk.tokenize import word_tokenize
from bs4 import BeautifulSoup 
import html
from nltk.probability import FreqDist
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords 

In [None]:
#25-28 
#Creating list to append tweet data to 
tweets_list2 = []

#Using TwitterSearchScraper to scrape data and append tweets to list 
for i, tweet in tqdm(enumerate(sntwitter.TwitterSearchScraper(['#Uber since:2021-02-01 until:2021-07-28','#Lyft since:2021-02-01 until:2021-07-28']).get_items())):
    if tweet.lang=='en':
        tweets_list2.append([tweet.date,tweet.id,tweet.content,tweet.user.username,tweet.lang])
        if i>100000:
            break
    
#Creating DataFrame from the tweets list above
tweets_df_hash_feb_july = pd.DataFrame(tweets_list2,columns=['Datetime','Tweet ID','Text','Username','Language'])

In [None]:
tweets_df_hash_feb_july

In [70]:
tweets_df_hash_feb_july.to_csv('Uber_Lyft_Feb_to_July.csv')

In [71]:
df = tweets_df_hash_feb_july

# Exploratory Data Analysis 

Let's start by taking a look at the users with the most amount of tweets in the dataset

In [87]:
df['Username'].value_counts().head(15)


RadioRideshare     5155
Cab4Now             546
_UberRealEstate     437
Jamyies             305
Copenhagen_bear     235
djt1940             213
Emmonspired         188
RideSafeWorld       122
servicesdown_       110
PaulDDDaughters     108
sharerepurchase     101
BRAVENEWEUROPE1      99
LoneStarSUVLimo      86
best_referral        85
LTDAtaxinews         83
Name: Username, dtype: int64

Going to look up what each of these accounts are for and if it makes sense to use them for this project:

* RadioRideshare has the most tweets in the dataset and it looks like its just an account that posts links to a radio stream for rideshares to play from? It only tweets about the current song playing and would not provide much insight into either company's reputation. Seems okay to drop this from the dataset 

* Cab4Now looks like it is a chauffeur competitor for Lyft and Uber, I want to look closer at their actual tweets about both to make a final decision 

* _ UberRealEstate is a real estate company that is conviently using the uber hashtag to promote their own company, this can definitely be dropped from the dataset. 

* Jamyies is a "stock guru" on twitter and the first couple of tweets I see that have either rideshare service tagged are just a mass block of hashtags. I want to look closer at their actual tweets about both to make a final decision 

* Copenhagen_bear is a driver for Uber and his tweets are generally about his experiences with Uber -- will definitely be keeping these in the dataset

* djt1940 looks like they actually do post quite a bit in terms of reviews and potential controversy -- will be keeping these in the dataset 

* Emmonspired is another uber/lyft driver that does tweet about their firsthand experience with working for them -- will be keeping in dataset 

* RideSafeWorld looks like something that reviews rideshare companies -- will be keeping these in the dataset 

* servicesdown_ just tweets when ... services... are down... so they are not contributing much to the conversation. Will drop these from the dataset 

* PaulDDDaughters tweets about controversial issues regarding Uber's business practices. Will be keeping these in the dataset 

* sharerepurchase tweets the same exact thing over and over again about Lyft arriving in Fort Meyers - no reason to keep these 



_Note: I understand that there are probably plenty of tweets within this dataset that are not directly targeted at our overall goal here, the reason I only chose to focus on the accounts listed above is solely due to the amount of tweets these acounts make up within this dataset, I am not sure that the average reviewer would have hundreds of tweets about either Lyft or Uber if they were trying to comment on their experience._

In [75]:
 #Taking a closer look at Cab4Now as mentioned above 
    
df.loc[df['Username']=='Cab4Now']

Unnamed: 0,Datetime,Tweet ID,Text,Username,Language
414,2021-07-24 19:14:07+00:00,1419013039219695617,Engineer who stole trade secrets from Google a...,Cab4Now,en
617,2021-07-23 07:13:31+00:00,1418469309387612161,Uber Eats rider died riding e-bike not approve...,Cab4Now,en
725,2021-07-22 13:13:39+00:00,1418197551786962944,Engineer who stole trade secrets from Google a...,Cab4Now,en
823,2021-07-21 19:14:04+00:00,1417925863564251141,Workers are again learning the power of collec...,Cab4Now,en
911,2021-07-21 01:14:12+00:00,1417654107972804611,The Guardian view on public sector jobs: keep ...,Cab4Now,en
...,...,...,...,...,...
32800,2021-02-01 20:07:35+00:00,1356333393936572417,Is this the end for the gig economy? | Aaron B...,Cab4Now,en
32832,2021-02-01 15:31:23+00:00,1356263888929628165,California Uber and Lyft drivers brace for shu...,Cab4Now,en
32870,2021-02-01 10:53:11+00:00,1356193877523390464,Limo Company To Pay $1.6M In Class Action Wage...,Cab4Now,en
32891,2021-02-01 06:17:45+00:00,1356124560719048707,"Judge grants Uber and Lyft temporary stay, ave...",Cab4Now,en


* Cab4Now: I wouldn't say that all of these tweets are necesarily the kinds of tweets we would be looking for in a project like this, but there are quite a few, I will work to keep these in the dataset

In [76]:
# Taking a closer look at Jamyies as mentioned above 

df.loc[df['Username']=='Jamyies']

Unnamed: 0,Datetime,Tweet ID,Text,Username,Language
91,2021-07-27 11:21:35+00:00,1419981288820273160,BITCOIN GETTING READY FOR ANOTHER BREAKOUT ⁉️\...,Jamyies,en
270,2021-07-26 01:26:10+00:00,1419469057032560640,"BITCOIN HIT MY TARGET OF $39,000 FROM YESTERDA...",Jamyies,en
334,2021-07-25 13:12:05+00:00,1419284319583883270,CONGRATULATIONS STOCK GURUS ON BITCOIN BUYS‼️ ...,Jamyies,en
360,2021-07-25 07:52:40+00:00,1419203934195838976,What a buy on Bitcoin 4 days ago 🎉 \n\nFollow ...,Jamyies,en
392,2021-07-25 00:59:14+00:00,1419099890261499905,"SATURDAY NIGHT LIVE AMA: #STOCKS, $BTC, $ETH I...",Jamyies,en
...,...,...,...,...,...
32781,2021-02-02 00:10:44+00:00,1356394585614766081,1 HOUR UNTIL I CHOOSE THE STUDENT WHO WINS $50...,Jamyies,en
32897,2021-02-01 05:38:46+00:00,1356114748975194114,Spa away from home spa.\n\nGet a message a day...,Jamyies,en
32906,2021-02-01 02:51:40+00:00,1356072696077926408,Thank you all for another Amazing Live chat wi...,Jamyies,en
32915,2021-02-01 00:57:02+00:00,1356043851459227650,LETS GO LIVE!!!!!!! IN 15 MINUTES!!\n\nJOIN TH...,Jamyies,en


* Jamyies does post quite a bit about things that are not related to Uber or Lyft at all, looks like he is potentially only tagging Uber to try to get more recognition or interactions on twitter. Can drop these from the dataset. 

#### Usernames To be dropped:  

    * RadioRideshare
    * _UberRealEstate
    * servicesdown_
    * sharerepurchase 
    * Jamyies 

In [128]:
# Dropping tweets from the aforementioned usernames 

df = df[df['Username']!= 'RadioRideshare']
df = df[df['Username']!= '_UberRealEstate']
df = df[df['Username']!= 'servicesdown_']
df = df[df['Username']!= 'sharerepurchase']
df = df[df['Username']!= 'Jamyies']

df

Unnamed: 0,Datetime,Tweet ID,Text,Username,Language
0,2021-07-27 23:23:08+00:00,1420162870101241856,@Uber_Canada #uber so you tell me I have a dis...,berthorny,en
1,2021-07-27 23:02:19+00:00,1420157632476745728,Life in prison for man in killing of South Car...,upstractcom,en
2,2021-07-27 23:00:52+00:00,1420157266658054151,"“Following the pandemic-led lockdowns, America...",badgerinstitute,en
3,2021-07-27 22:50:30+00:00,1420154657696100364,Great security at @iflymia @DHSgov with no des...,RafaelAntun,en
4,2021-07-27 22:39:35+00:00,1420151910905040896,@SkyNews Can govt start policing companies tha...,peaceandprotect,en
...,...,...,...,...,...
32911,2021-02-01 02:08:26+00:00,1356061819576725508,👉 Win a $100 #UBER gift card 🚘 🤑 Enter here: h...,dataentrytard3,en
32912,2021-02-01 01:42:33+00:00,1356055305872936960,Are gig economy disrupters finally running out...,Cab4Now,en
32913,2021-02-01 01:18:51+00:00,1356049337965490178,The $10 monthly Uber credit from the Amex Gold...,SmartLivingSci,en
32917,2021-02-01 00:24:18+00:00,1356035612218970114,@UberEats Is this seriously how you guys want ...,GetStoked_On_It,en


In [129]:
# Want to make sure there are no repeating TweetIDs 

df['Tweet ID'].value_counts().head()

1420162870101241856    1
1371891322181857281    1
1371933680227258374    1
1371933714184290313    1
1371933747826860035    1
Name: Tweet ID, dtype: int64

Doesn't look like there are any repeating Tweet IDs which means all tweets in the dataset are in fact unique.

* I want to look through the data to see the number of instances for each ridesharing company and compare the number of mentions. Ultimately looking to see how balanced our data is before we move forward. 


In [130]:
Uber = df.loc[df['Text'].str.contains('Uber',case=False)]
Uber

Unnamed: 0,Datetime,Tweet ID,Text,Username,Language
0,2021-07-27 23:23:08+00:00,1420162870101241856,@Uber_Canada #uber so you tell me I have a dis...,berthorny,en
1,2021-07-27 23:02:19+00:00,1420157632476745728,Life in prison for man in killing of South Car...,upstractcom,en
2,2021-07-27 23:00:52+00:00,1420157266658054151,"“Following the pandemic-led lockdowns, America...",badgerinstitute,en
3,2021-07-27 22:50:30+00:00,1420154657696100364,Great security at @iflymia @DHSgov with no des...,RafaelAntun,en
4,2021-07-27 22:39:35+00:00,1420151910905040896,@SkyNews Can govt start policing companies tha...,peaceandprotect,en
...,...,...,...,...,...
32911,2021-02-01 02:08:26+00:00,1356061819576725508,👉 Win a $100 #UBER gift card 🚘 🤑 Enter here: h...,dataentrytard3,en
32912,2021-02-01 01:42:33+00:00,1356055305872936960,Are gig economy disrupters finally running out...,Cab4Now,en
32913,2021-02-01 01:18:51+00:00,1356049337965490178,The $10 monthly Uber credit from the Amex Gold...,SmartLivingSci,en
32917,2021-02-01 00:24:18+00:00,1356035612218970114,@UberEats Is this seriously how you guys want ...,GetStoked_On_It,en


In [131]:
Uber_hash = df.loc[df['Text'].str.contains('#Uber',case=False)]
Uber_hash

Unnamed: 0,Datetime,Tweet ID,Text,Username,Language
0,2021-07-27 23:23:08+00:00,1420162870101241856,@Uber_Canada #uber so you tell me I have a dis...,berthorny,en
1,2021-07-27 23:02:19+00:00,1420157632476745728,Life in prison for man in killing of South Car...,upstractcom,en
2,2021-07-27 23:00:52+00:00,1420157266658054151,"“Following the pandemic-led lockdowns, America...",badgerinstitute,en
3,2021-07-27 22:50:30+00:00,1420154657696100364,Great security at @iflymia @DHSgov with no des...,RafaelAntun,en
4,2021-07-27 22:39:35+00:00,1420151910905040896,@SkyNews Can govt start policing companies tha...,peaceandprotect,en
...,...,...,...,...,...
32911,2021-02-01 02:08:26+00:00,1356061819576725508,👉 Win a $100 #UBER gift card 🚘 🤑 Enter here: h...,dataentrytard3,en
32912,2021-02-01 01:42:33+00:00,1356055305872936960,Are gig economy disrupters finally running out...,Cab4Now,en
32913,2021-02-01 01:18:51+00:00,1356049337965490178,The $10 monthly Uber credit from the Amex Gold...,SmartLivingSci,en
32917,2021-02-01 00:24:18+00:00,1356035612218970114,@UberEats Is this seriously how you guys want ...,GetStoked_On_It,en


In [132]:
Lyft = df.loc[df['Text'].str.contains('Lyft',case=False)]
Lyft

Unnamed: 0,Datetime,Tweet ID,Text,Username,Language
3,2021-07-27 22:50:30+00:00,1420154657696100364,Great security at @iflymia @DHSgov with no des...,RafaelAntun,en
6,2021-07-27 22:18:04+00:00,1420146498499719169,@CTVNews This is a warming for everyone who us...,BadUberX,en
7,2021-07-27 21:49:17+00:00,1420139254928248832,Should we ban #ridesharing companies? https://...,JohnKinyuaKE,en
10,2021-07-27 21:28:44+00:00,1420134081564708867,Research reveals that @Uber #RideSharing servi...,SharpPlaysGroup,en
12,2021-07-27 21:17:26+00:00,1420131238887628803,Is it cheaper to take Uber or Lyft in Los Ange...,StevenMSweat,en
...,...,...,...,...,...
32888,2021-02-01 07:16:08+00:00,1356139254754267138,Do you #drive #UBER or #LYFT Do not get caught...,AnswersRide,en
32889,2021-02-01 07:06:42+00:00,1356136880140644353,@GordonJohnson19 @PunchableFaceVI Tesla Police...,Donald66073620,en
32891,2021-02-01 06:17:45+00:00,1356124560719048707,"Judge grants Uber and Lyft temporary stay, ave...",Cab4Now,en
32904,2021-02-01 02:59:43+00:00,1356074724942462976,"5 years ago, #Uber insisted that it would redu...",BrentToderian,en


In [133]:
Lyft_hash = df.loc[df['Text'].str.contains('#Lyft',case=False)]
Lyft_hash

Unnamed: 0,Datetime,Tweet ID,Text,Username,Language
3,2021-07-27 22:50:30+00:00,1420154657696100364,Great security at @iflymia @DHSgov with no des...,RafaelAntun,en
6,2021-07-27 22:18:04+00:00,1420146498499719169,@CTVNews This is a warming for everyone who us...,BadUberX,en
7,2021-07-27 21:49:17+00:00,1420139254928248832,Should we ban #ridesharing companies? https://...,JohnKinyuaKE,en
12,2021-07-27 21:17:26+00:00,1420131238887628803,Is it cheaper to take Uber or Lyft in Los Ange...,StevenMSweat,en
16,2021-07-27 20:45:55+00:00,1420123308192931840,The @OFLabour laying out what’s at stake for #...,ridefairTO,en
...,...,...,...,...,...
32839,2021-02-01 14:31:16+00:00,1356248756166004743,Airports have been hit by the rise of #Uber an...,vahelpers,en
32848,2021-02-01 13:30:22+00:00,1356233433991806978,Airports have been hit by the rise of #Uber an...,vahelpers,en
32888,2021-02-01 07:16:08+00:00,1356139254754267138,Do you #drive #UBER or #LYFT Do not get caught...,AnswersRide,en
32889,2021-02-01 07:06:42+00:00,1356136880140644353,@GordonJohnson19 @PunchableFaceVI Tesla Police...,Donald66073620,en


It looks like there are much more instances of Uber being mentioned in our dataset, could potentially scrape for more data to get more mentions of Lyft. It wouldn't be a problem to have our data look like this if we were only looking at positive and negative reviews, but since I want to compare the two companies reputations I think it would make sense to have a more balanced dataset. 

In [139]:
#25-28 
#Creating list to append tweet data to 
tweets_list2 = []

#Using TwitterSearchScraper to scrape data and append tweets to list 
for i, tweet in tqdm(enumerate(sntwitter.TwitterSearchScraper(['#Lyft since:2021-01-01 until:2021-07-28']).get_items())):
    if tweet.lang=='en':
        tweets_list2.append([tweet.date,tweet.id,tweet.content,tweet.user.username,tweet.lang])
        if i>100000:
            break
    
#Creating DataFrame from the tweets list above
tweets_df_hash_feb_july_Lyft_only = pd.DataFrame(tweets_list2,columns=['Datetime','Tweet ID','Text','Username','Language'])

13672it [04:11, 54.46it/s]


In [140]:
df_Lyft_only = tweets_df_hash_feb_july_Lyft_only
df_Lyft_only

Unnamed: 0,Datetime,Tweet ID,Text,Username,Language
0,2021-07-27 23:25:44+00:00,1420163525134233605,"@lyft, hey I missed my flight today. I reserv...",espoproducer,en
1,2021-07-27 22:50:30+00:00,1420154657696100364,Great security at @iflymia @DHSgov with no des...,RafaelAntun,en
2,2021-07-27 22:45:20+00:00,1420153357675634688,Self-driving rides with safety drivers will be...,rbyatt,en
3,2021-07-27 22:18:04+00:00,1420146498499719169,@CTVNews This is a warming for everyone who us...,BadUberX,en
4,2021-07-27 21:49:17+00:00,1420139254928248832,Should we ban #ridesharing companies? https://...,JohnKinyuaKE,en
...,...,...,...,...,...
12551,2021-01-01 01:30:57+00:00,1344818359574163456,Please don’t drink and drive! Think about it! ...,jericka_w,en
12552,2021-01-01 01:04:59+00:00,1344811828564664320,"For $5 in ride credit, download the Lyft app u...",CassiopeiaBQ,en
12553,2021-01-01 00:43:02+00:00,1344806301168242689,Car check! Make it a habit to check all these ...,App_ADITT,en
12554,2021-01-01 00:18:34+00:00,1344800145595105282,"Requested a #Lyft, #Uber, or #Taxi through an ...",RideSafeWorld,en


* If I only run to February there are about 1000 fewer datapoints (and this is before cleaning the dataset at all)

In [142]:
df_Lyft_only.to_csv('Jan_to_Feb_Lyft_only.csv')

In [141]:
df_Lyft_only['Username'].value_counts().head(15)

RadioRideshare     5154
Emmonspired         162
AllNaturalPics      155
sharerepurchase     129
LaughOutNOW         113
gigeconpodcast       78
willgriesmer         73
LoneStarSUVLimo      65
vahelpers            62
ChoicesMatter_       55
_Long_n_Short        52
FreewayUKIns         51
PINNICOCARS          47
MarcoDaCostaFX       45
best_referral        42
Name: Username, dtype: int64

* We already know that RadioRideshare needs to be dropped (This is almost half of our dataset...) 

* AllNaturalPics needs to be dropped -- they tweet the same thing over and over again and it is always a voucher code 

* sharerepurchase needs to be dropped 

* LaughOutNOW needs to be dropped -- tags lyft but does not talk about lyft in their tweets 



In [143]:
df_Lyft_only = df_Lyft_only[df_Lyft_only['Username']!= 'RadioRideshare']
df_Lyft_only = df_Lyft_only[df_Lyft_only['Username']!= 'AllNaturalPics']
df_Lyft_only = df_Lyft_only[df_Lyft_only['Username']!= 'sharerepurchase']
df_Lyft_only = df_Lyft_only[df_Lyft_only['Username']!= 'LaughOutNOW']

df_Lyft_only


Unnamed: 0,Datetime,Tweet ID,Text,Username,Language
0,2021-07-27 23:25:44+00:00,1420163525134233605,"@lyft, hey I missed my flight today. I reserv...",espoproducer,en
1,2021-07-27 22:50:30+00:00,1420154657696100364,Great security at @iflymia @DHSgov with no des...,RafaelAntun,en
2,2021-07-27 22:45:20+00:00,1420153357675634688,Self-driving rides with safety drivers will be...,rbyatt,en
3,2021-07-27 22:18:04+00:00,1420146498499719169,@CTVNews This is a warming for everyone who us...,BadUberX,en
4,2021-07-27 21:49:17+00:00,1420139254928248832,Should we ban #ridesharing companies? https://...,JohnKinyuaKE,en
...,...,...,...,...,...
12551,2021-01-01 01:30:57+00:00,1344818359574163456,Please don’t drink and drive! Think about it! ...,jericka_w,en
12552,2021-01-01 01:04:59+00:00,1344811828564664320,"For $5 in ride credit, download the Lyft app u...",CassiopeiaBQ,en
12553,2021-01-01 00:43:02+00:00,1344806301168242689,Car check! Make it a habit to check all these ...,App_ADITT,en
12554,2021-01-01 00:18:34+00:00,1344800145595105282,"Requested a #Lyft, #Uber, or #Taxi through an ...",RideSafeWorld,en


In [145]:
df = pd.concat([df,df_Lyft_only])
df

Unnamed: 0,Datetime,Tweet ID,Text,Username,Language
0,2021-07-27 23:23:08+00:00,1420162870101241856,@Uber_Canada #uber so you tell me I have a dis...,berthorny,en
1,2021-07-27 23:02:19+00:00,1420157632476745728,Life in prison for man in killing of South Car...,upstractcom,en
2,2021-07-27 23:00:52+00:00,1420157266658054151,"“Following the pandemic-led lockdowns, America...",badgerinstitute,en
3,2021-07-27 22:50:30+00:00,1420154657696100364,Great security at @iflymia @DHSgov with no des...,RafaelAntun,en
4,2021-07-27 22:39:35+00:00,1420151910905040896,@SkyNews Can govt start policing companies tha...,peaceandprotect,en
...,...,...,...,...,...
12551,2021-01-01 01:30:57+00:00,1344818359574163456,Please don’t drink and drive! Think about it! ...,jericka_w,en
12552,2021-01-01 01:04:59+00:00,1344811828564664320,"For $5 in ride credit, download the Lyft app u...",CassiopeiaBQ,en
12553,2021-01-01 00:43:02+00:00,1344806301168242689,Car check! Make it a habit to check all these ...,App_ADITT,en
12554,2021-01-01 00:18:34+00:00,1344800145595105282,"Requested a #Lyft, #Uber, or #Taxi through an ...",RideSafeWorld,en


Since I am joining these dataframes, I need to make sure there are no duplicates and if there are, I will need to drop the duplicates 

In [151]:
df['Tweet ID'].value_counts()

1387463622926888961    3
1377319645682622464    3
1371930555818856449    3
1377297299697766405    3
1413184101906006016    3
                      ..
1387539891542663171    1
1387540239778996230    1
1387540394603286531    1
1387540773961273346    1
1376502807436664832    1
Name: Tweet ID, Length: 30414, dtype: int64

In [157]:
df.drop_duplicates(inplace=True)

In [162]:
#let's see if we have any null values 
df.isnull().sum()

Datetime    0
Tweet ID    0
Text        0
Username    0
Language    0
dtype: int64

In [158]:
Lyft = df.loc[df['Text'].str.contains('Lyft',case=False)]
Lyft

Unnamed: 0,Datetime,Tweet ID,Text,Username,Language
3,2021-07-27 22:50:30+00:00,1420154657696100364,Great security at @iflymia @DHSgov with no des...,RafaelAntun,en
6,2021-07-27 22:18:04+00:00,1420146498499719169,@CTVNews This is a warming for everyone who us...,BadUberX,en
7,2021-07-27 21:49:17+00:00,1420139254928248832,Should we ban #ridesharing companies? https://...,JohnKinyuaKE,en
10,2021-07-27 21:28:44+00:00,1420134081564708867,Research reveals that @Uber #RideSharing servi...,SharpPlaysGroup,en
12,2021-07-27 21:17:26+00:00,1420131238887628803,Is it cheaper to take Uber or Lyft in Los Ange...,StevenMSweat,en
...,...,...,...,...,...
12551,2021-01-01 01:30:57+00:00,1344818359574163456,Please don’t drink and drive! Think about it! ...,jericka_w,en
12552,2021-01-01 01:04:59+00:00,1344811828564664320,"For $5 in ride credit, download the Lyft app u...",CassiopeiaBQ,en
12553,2021-01-01 00:43:02+00:00,1344806301168242689,Car check! Make it a habit to check all these ...,App_ADITT,en
12554,2021-01-01 00:18:34+00:00,1344800145595105282,"Requested a #Lyft, #Uber, or #Taxi through an ...",RideSafeWorld,en


In [159]:
Uber = df.loc[df['Text'].str.contains('Uber',case=False)]
Uber

Unnamed: 0,Datetime,Tweet ID,Text,Username,Language
0,2021-07-27 23:23:08+00:00,1420162870101241856,@Uber_Canada #uber so you tell me I have a dis...,berthorny,en
1,2021-07-27 23:02:19+00:00,1420157632476745728,Life in prison for man in killing of South Car...,upstractcom,en
2,2021-07-27 23:00:52+00:00,1420157266658054151,"“Following the pandemic-led lockdowns, America...",badgerinstitute,en
3,2021-07-27 22:50:30+00:00,1420154657696100364,Great security at @iflymia @DHSgov with no des...,RafaelAntun,en
4,2021-07-27 22:39:35+00:00,1420151910905040896,@SkyNews Can govt start policing companies tha...,peaceandprotect,en
...,...,...,...,...,...
12550,2021-01-01 01:45:51+00:00,1344822112511533056,Head of Uber Eats zoom call.\n\n#UberEats adds...,KC15509358,en
12551,2021-01-01 01:30:57+00:00,1344818359574163456,Please don’t drink and drive! Think about it! ...,jericka_w,en
12552,2021-01-01 01:04:59+00:00,1344811828564664320,"For $5 in ride credit, download the Lyft app u...",CassiopeiaBQ,en
12553,2021-01-01 00:43:02+00:00,1344806301168242689,Car check! Make it a habit to check all these ...,App_ADITT,en


There are many more tweets for Uber than there are for Lyft, my immediate thought would be that this means Lyft is potentially less problematic than Uber, after some research, I discovered that Lyft only operates out of the U.S. and Canada -- meaning they have a much smaller customer pool. 

* Now that I have the dataset cleaned up, I'm not sure that I need the Tweet ID, Username, or Language columns?  

### Further processing results in an attempt for better results

In [2]:
# Let's try to clean up our text data a little bit more 
df = pd.read_csv('/Users/jchap/Desktop/Cap3_Sentiment_Analysis_Project/Data/Cleaned_dataset_v1.csv')

In [3]:
df.drop(columns=['Unnamed: 0','Tweet ID','Language'],inplace=True)

In [4]:
df.dropna(inplace=True)
df.reset_index(inplace=True)

In [5]:
df.isnull().sum()

index       0
Datetime    0
Text        0
Username    0
dtype: int64

In [6]:
#dropping html character entities 
df.Text = df.Text.apply(html.unescape)

In [7]:
#dropping https links from the dataframe 
df['Text'] = df['Text'].str.replace(r'http\S+','')

  df['Text'] = df['Text'].str.replace(r'http\S+','')


In [8]:
#getting rid of spaces 
df['Text'] = df['Text'].str.replace('\n','')

I want to try to drop hashtags -- I'm not sure that these will make much of a difference since Vader Sentiment is supposed to be able to handle text like that -- going to create two different dataframes (one with hashtags and one without) and see how if I notice a difference after labeling. 

In [9]:
df_nohash = df.copy()

In [11]:
df_nohash['Text'] = df_nohash['Text'].str.replace('#','')

In [12]:
df_nohash['Text'] = df_nohash['Text'].str.replace('@','')

In [13]:
df_nohash

Unnamed: 0,index,Datetime,Text,Username
0,0,2021-07-27 23:23:08+00:00,Uber_Canada uber so you tell me I have a disco...,berthorny
1,1,2021-07-27 23:02:19+00:00,Life in prison for man in killing of South Car...,upstractcom
2,2,2021-07-27 23:00:52+00:00,"“Following the pandemic-led lockdowns, America...",badgerinstitute
3,3,2021-07-27 22:50:30+00:00,Great security at iflymia DHSgov with no desig...,RafaelAntun
4,4,2021-07-27 22:39:35+00:00,SkyNews Can govt start policing companies that...,peaceandprotect
...,...,...,...,...
30368,30491,2021-01-01 01:30:57+00:00,Please don’t drink and drive! Think about it! ...,jericka_w
30369,30492,2021-01-01 01:04:59+00:00,"For $5 in ride credit, download the Lyft app u...",CassiopeiaBQ
30370,30493,2021-01-01 00:43:02+00:00,Car check! Make it a habit to check all these ...,App_ADITT
30371,30494,2021-01-01 00:18:34+00:00,"Requested a Lyft, Uber, or Taxi through an app?",RideSafeWorld


In [14]:
df

Unnamed: 0,index,Datetime,Text,Username
0,0,2021-07-27 23:23:08+00:00,@Uber_Canada #uber so you tell me I have a dis...,berthorny
1,1,2021-07-27 23:02:19+00:00,Life in prison for man in killing of South Car...,upstractcom
2,2,2021-07-27 23:00:52+00:00,"“Following the pandemic-led lockdowns, America...",badgerinstitute
3,3,2021-07-27 22:50:30+00:00,Great security at @iflymia @DHSgov with no des...,RafaelAntun
4,4,2021-07-27 22:39:35+00:00,@SkyNews Can govt start policing companies tha...,peaceandprotect
...,...,...,...,...
30368,30491,2021-01-01 01:30:57+00:00,Please don’t drink and drive! Think about it! ...,jericka_w
30369,30492,2021-01-01 01:04:59+00:00,"For $5 in ride credit, download the Lyft app u...",CassiopeiaBQ
30370,30493,2021-01-01 00:43:02+00:00,Car check! Make it a habit to check all these ...,App_ADITT
30371,30494,2021-01-01 00:18:34+00:00,"Requested a #Lyft, #Uber, or #Taxi through an ...",RideSafeWorld


Want to work on Tokenization, Stemming, and Lemmatization. 

In [15]:
df.Text = df.Text.apply(word_tokenize)

In [16]:
df_nohash.Text = df_nohash.Text.apply(word_tokenize)

In [17]:
df

Unnamed: 0,index,Datetime,Text,Username
0,0,2021-07-27 23:23:08+00:00,"[@, Uber_Canada, #, uber, so, you, tell, me, I...",berthorny
1,1,2021-07-27 23:02:19+00:00,"[Life, in, prison, for, man, in, killing, of, ...",upstractcom
2,2,2021-07-27 23:00:52+00:00,"[“, Following, the, pandemic-led, lockdowns, ,...",badgerinstitute
3,3,2021-07-27 22:50:30+00:00,"[Great, security, at, @, iflymia, @, DHSgov, w...",RafaelAntun
4,4,2021-07-27 22:39:35+00:00,"[@, SkyNews, Can, govt, start, policing, compa...",peaceandprotect
...,...,...,...,...
30368,30491,2021-01-01 01:30:57+00:00,"[Please, don, ’, t, drink, and, drive, !, Thin...",jericka_w
30369,30492,2021-01-01 01:04:59+00:00,"[For, $, 5, in, ride, credit, ,, download, the...",CassiopeiaBQ
30370,30493,2021-01-01 00:43:02+00:00,"[Car, check, !, Make, it, a, habit, to, check,...",App_ADITT
30371,30494,2021-01-01 00:18:34+00:00,"[Requested, a, #, Lyft, ,, #, Uber, ,, or, #, ...",RideSafeWorld


In [18]:
df_nohash

Unnamed: 0,index,Datetime,Text,Username
0,0,2021-07-27 23:23:08+00:00,"[Uber_Canada, uber, so, you, tell, me, I, have...",berthorny
1,1,2021-07-27 23:02:19+00:00,"[Life, in, prison, for, man, in, killing, of, ...",upstractcom
2,2,2021-07-27 23:00:52+00:00,"[“, Following, the, pandemic-led, lockdowns, ,...",badgerinstitute
3,3,2021-07-27 22:50:30+00:00,"[Great, security, at, iflymia, DHSgov, with, n...",RafaelAntun
4,4,2021-07-27 22:39:35+00:00,"[SkyNews, Can, govt, start, policing, companie...",peaceandprotect
...,...,...,...,...
30368,30491,2021-01-01 01:30:57+00:00,"[Please, don, ’, t, drink, and, drive, !, Thin...",jericka_w
30369,30492,2021-01-01 01:04:59+00:00,"[For, $, 5, in, ride, credit, ,, download, the...",CassiopeiaBQ
30370,30493,2021-01-01 00:43:02+00:00,"[Car, check, !, Make, it, a, habit, to, check,...",App_ADITT
30371,30494,2021-01-01 00:18:34+00:00,"[Requested, a, Lyft, ,, Uber, ,, or, Taxi, thr...",RideSafeWorld


In [19]:
lem = WordNetLemmatizer()

df.Text = tqdm(df.Text.apply(lambda x: [lem.lemmatize(y) for y in x]))

100%|██████████| 30373/30373 [00:00<00:00, 1379524.78it/s]


In [20]:
df_nohash.Text = tqdm(df_nohash.Text.apply(lambda x: [lem.lemmatize(y) for y in x]))

100%|██████████| 30373/30373 [00:00<00:00, 1960414.19it/s]


In [21]:
df_nohash

Unnamed: 0,index,Datetime,Text,Username
0,0,2021-07-27 23:23:08+00:00,"[Uber_Canada, uber, so, you, tell, me, I, have...",berthorny
1,1,2021-07-27 23:02:19+00:00,"[Life, in, prison, for, man, in, killing, of, ...",upstractcom
2,2,2021-07-27 23:00:52+00:00,"[“, Following, the, pandemic-led, lockdown, ,,...",badgerinstitute
3,3,2021-07-27 22:50:30+00:00,"[Great, security, at, iflymia, DHSgov, with, n...",RafaelAntun
4,4,2021-07-27 22:39:35+00:00,"[SkyNews, Can, govt, start, policing, company,...",peaceandprotect
...,...,...,...,...
30368,30491,2021-01-01 01:30:57+00:00,"[Please, don, ’, t, drink, and, drive, !, Thin...",jericka_w
30369,30492,2021-01-01 01:04:59+00:00,"[For, $, 5, in, ride, credit, ,, download, the...",CassiopeiaBQ
30370,30493,2021-01-01 00:43:02+00:00,"[Car, check, !, Make, it, a, habit, to, check,...",App_ADITT
30371,30494,2021-01-01 00:18:34+00:00,"[Requested, a, Lyft, ,, Uber, ,, or, Taxi, thr...",RideSafeWorld


In [22]:
df

Unnamed: 0,index,Datetime,Text,Username
0,0,2021-07-27 23:23:08+00:00,"[@, Uber_Canada, #, uber, so, you, tell, me, I...",berthorny
1,1,2021-07-27 23:02:19+00:00,"[Life, in, prison, for, man, in, killing, of, ...",upstractcom
2,2,2021-07-27 23:00:52+00:00,"[“, Following, the, pandemic-led, lockdown, ,,...",badgerinstitute
3,3,2021-07-27 22:50:30+00:00,"[Great, security, at, @, iflymia, @, DHSgov, w...",RafaelAntun
4,4,2021-07-27 22:39:35+00:00,"[@, SkyNews, Can, govt, start, policing, compa...",peaceandprotect
...,...,...,...,...
30368,30491,2021-01-01 01:30:57+00:00,"[Please, don, ’, t, drink, and, drive, !, Thin...",jericka_w
30369,30492,2021-01-01 01:04:59+00:00,"[For, $, 5, in, ride, credit, ,, download, the...",CassiopeiaBQ
30370,30493,2021-01-01 00:43:02+00:00,"[Car, check, !, Make, it, a, habit, to, check,...",App_ADITT
30371,30494,2021-01-01 00:18:34+00:00,"[Requested, a, #, Lyft, ,, #, Uber, ,, or, #, ...",RideSafeWorld


Not going to stem the data: 

* removing stop words seems like a risky move after analyzing some of the text 
* attempted to stem the text but was suffering from over stemming and the fix to this would be to remove stop words and punctutaion -- both things that I believe hold an important place in the text, especially since we are using Vader Sentiment for analysis. 
* not seeing much of a difference after lemmatization but I feel like this is something I can analyze after labeling the data.


Not seeing much benefit to keeping hashtags or @ symbols in the text after tokenizing the data, going to keep moving with the dataframe that does not contain hashtags and drop @ symbols. I want to keep other punction (!,?,etc) because Vader Sentiment will take this into account when determining sentiment. 

In [23]:
df_nohash

Unnamed: 0,index,Datetime,Text,Username
0,0,2021-07-27 23:23:08+00:00,"[Uber_Canada, uber, so, you, tell, me, I, have...",berthorny
1,1,2021-07-27 23:02:19+00:00,"[Life, in, prison, for, man, in, killing, of, ...",upstractcom
2,2,2021-07-27 23:00:52+00:00,"[“, Following, the, pandemic-led, lockdown, ,,...",badgerinstitute
3,3,2021-07-27 22:50:30+00:00,"[Great, security, at, iflymia, DHSgov, with, n...",RafaelAntun
4,4,2021-07-27 22:39:35+00:00,"[SkyNews, Can, govt, start, policing, company,...",peaceandprotect
...,...,...,...,...
30368,30491,2021-01-01 01:30:57+00:00,"[Please, don, ’, t, drink, and, drive, !, Thin...",jericka_w
30369,30492,2021-01-01 01:04:59+00:00,"[For, $, 5, in, ride, credit, ,, download, the...",CassiopeiaBQ
30370,30493,2021-01-01 00:43:02+00:00,"[Car, check, !, Make, it, a, habit, to, check,...",App_ADITT
30371,30494,2021-01-01 00:18:34+00:00,"[Requested, a, Lyft, ,, Uber, ,, or, Taxi, thr...",RideSafeWorld


In [26]:
df_nohash.to_csv('/Users/jchap/Desktop/Cap3_Sentiment_Analysis_Project/Data/Cleaned_dataset_nosymb_data.csv')