# **Twitter Step 2: Parse and Clean Tweets**
By: Jon Chun
30 Nov 2020

* Parse tweets into components (e.g. hashtags, emojis, etc)
* Clean the main text of the tweets (e.g. lowercase, remove punct, etc)

Reference:

* https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/

# **0. Setup Environment**

## You will need to give permission for this Colab to link to your gdrive in the code cell below

In [1]:
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

Mounted at /gdrive
/gdrive


In [2]:
!pwd

/gdrive


In [3]:
# SET working directory

dir_working = './MyDrive/courses/2020f_iphs200_programming_humanity/code/twint/'

In [4]:
# CUSTOMIZE: if you want your work and twitter datasets saved into a specific folder
#            beneath your gdrive root directory, define it below

%cd $dir_working

/gdrive/MyDrive/courses/2020f_iphs200_programming_humanity/code/twint


In [5]:
!pwd

/gdrive/MyDrive/courses/2020f_iphs200_programming_humanity/code/twint


In [6]:
!ls -al *.csv

-rw------- 1 root root   986295 Dec  5 14:31 cleaned_tweets_combined_20201204-193123.csv
-rw------- 1 root root   782295 Dec  4 19:31 tweets_combined_20201204-193111.csv
-rw------- 1 root root   782295 Dec  4 19:31 tweets_combined_20201204-193123.csv
-rw------- 1 root root    40893 Dec  4 19:57 tweets_ner_sa__20201204-195723.csv
-rw------- 1 root root  5665660 Dec  8 16:42 tweets_twint_stopthesteal__20201208-164240.csv
-rw------- 1 root root 28209455 Dec  8 16:46 tweets_twint_stopthesteal__20201208-164638.csv
-rw------- 1 root root   790248 Dec  4 19:29 tweets_twint_tesla__20201204-192926.csv
-rw------- 1 root root   790257 Dec  5 14:18 tweets_twint_tesla__20201205-141801.csv


In [7]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [8]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

  after removing the cwd from sys.path.


In [9]:
import os
import re
import glob


In [10]:
!pip install contractions

Collecting contractions
  Downloading https://files.pythonhosted.org/packages/ce/ad/d1c685967945a04f8596128b15a1ab56c51488f53312e953341af6ff22d1/contractions-0.0.43-py2.py3-none-any.whl
Collecting textsearch
  Downloading https://files.pythonhosted.org/packages/42/a8/03407021f9555043de5492a2bd7a35c56cc03c2510092b5ec018cae1bbf1/textsearch-0.0.17-py2.py3-none-any.whl
Collecting pyahocorasick
[?25l  Downloading https://files.pythonhosted.org/packages/f4/9f/f0d8e8850e12829eea2e778f1c90e3c53a9a799b7f412082a5d21cd19ae1/pyahocorasick-1.4.0.tar.gz (312kB)
[K     |████████████████████████████████| 317kB 5.6MB/s 
[?25hCollecting Unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 25.0MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  

In [11]:
import contractions

# **1. Read in Combined Tweet Dataset File**

In [12]:
!ls -al *.csv

-rw------- 1 root root   986295 Dec  5 14:31 cleaned_tweets_combined_20201204-193123.csv
-rw------- 1 root root   782295 Dec  4 19:31 tweets_combined_20201204-193111.csv
-rw------- 1 root root   782295 Dec  4 19:31 tweets_combined_20201204-193123.csv
-rw------- 1 root root    40893 Dec  4 19:57 tweets_ner_sa__20201204-195723.csv
-rw------- 1 root root  5665660 Dec  8 16:42 tweets_twint_stopthesteal__20201208-164240.csv
-rw------- 1 root root 28209455 Dec  8 16:46 tweets_twint_stopthesteal__20201208-164638.csv
-rw------- 1 root root   790248 Dec  4 19:29 tweets_twint_tesla__20201204-192926.csv
-rw------- 1 root root   790257 Dec  5 14:18 tweets_twint_tesla__20201205-141801.csv


In [13]:
# CONFIGURE: Set the 'file_name_all' to the name of the combined datafile with all the tweets
#            which should be listed in the previous code cell

file_name_all = 'tweets_combined_20201204-193123.csv'

In [14]:
combined_df = pd.read_csv(file_name_all, encoding='utf-8')
combined_df = combined_df.convert_dtypes()
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1591 entries, 0 to 1590
Data columns (total 36 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               1591 non-null   Int64  
 1   conversation_id  1591 non-null   Int64  
 2   created_at       1591 non-null   string 
 3   date             1591 non-null   string 
 4   time             1591 non-null   string 
 5   timezone         1591 non-null   Int64  
 6   user_id          1591 non-null   Int64  
 7   username         1591 non-null   string 
 8   name             1591 non-null   string 
 9   place            0 non-null      Int64  
 10  tweet            1591 non-null   string 
 11  language         1591 non-null   string 
 12  mentions         1591 non-null   string 
 13  urls             1591 non-null   string 
 14  photos           1591 non-null   string 
 15  replies_count    1591 non-null   Int64  
 16  retweets_count   1591 non-null   Int64  
 17  likes_count   

In [15]:
combined_df.shape

(1591, 36)

In [16]:
combined_df.head()

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,language,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,thumbnail,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,1334127507906314240,1333882505775050752,2020-12-02 13:29:19 UTC,2020-12-02,13:29:19,0,44196397,elonmusk,Elon Musk,,"@Tesmanian_com Award accepted on behalf of the great people at Tesla, SpaceX, Neuralink &amp; Boring Co",en,[],[],[],313,284,8385,[],[],https://twitter.com/elonmusk/status/1334127507906314240,False,,0,,,,,,,,"[{'screen_name': 'Tesmanian_com', 'name': 'Tesmanian.com', 'id': '1100520274200416256'}]",,,,
1,1331075177262661633,1330843965613027329,2020-11-24 03:20:27 UTC,2020-11-24,03:20:27,0,44196397,elonmusk,Elon Musk,,"@PPathole @Teslarati @TeslaRoadTrip We’re still far from simply video in, control out. The biggest game-changer, currently underway at Tesla, is 360 degree, high fps video for labeling, training &amp; inference.",en,[],[],[],135,97,2542,[],[],https://twitter.com/elonmusk/status/1331075177262661633,False,,0,,,,,,,,"[{'screen_name': 'PPathole', 'name': 'Pranay Pathole', 'id': '1291945442'}, {'screen_name': 'Teslarati', 'name': 'TESLARATI', 'id': '1308211178'}, {'screen_name': 'TeslaRoadTrip', 'name': 'TeslaRoadTrip', 'id': '1182382878'}]",,,,
2,1330982572038500355,1330980211509186560,2020-11-23 21:12:28 UTC,2020-11-23,21:12:28,0,44196397,elonmusk,Elon Musk,,@vincent13031925 @Tesla Wow,und,[],[],[],790,367,19411,[],[],https://twitter.com/elonmusk/status/1330982572038500355,False,,0,,,,,,,,"[{'screen_name': 'vincent13031925', 'name': 'Vincent 🚀\U0001f7e0', 'id': '1689516060'}, {'screen_name': 'Tesla', 'name': 'Tesla', 'id': '13298072'}]",,,,
3,1330206139385044999,1330179916587884544,2020-11-21 17:47:12 UTC,2020-11-21,17:47:12,0,44196397,elonmusk,Elon Musk,,@heydave7 @philwhln Tesla is a vehicle for creating &amp; producing many useful products,en,[],[],[],103,106,1941,[],[],https://twitter.com/elonmusk/status/1330206139385044999,False,,0,,,,,,,,"[{'screen_name': 'heydave7', 'name': 'Dave Lee', 'id': '29893444'}, {'screen_name': 'philwhln', 'name': 'Phil Whelan 😷🇨🇦\U0001f995🚀', 'id': '13036132'}]",,,,
4,1330187635483140099,1330179916587884544,2020-11-21 16:33:40 UTC,2020-11-21,16:33:40,0,44196397,elonmusk,Elon Musk,,"@philwhln @heydave7 Because I am not an investor. Tesla is definitely not the only good company, but investing is not what I do. But I always put my own money into companies I help create, otherwise it’d be wrong to ask others to do so.",en,[],[],[],130,145,2495,[],[],https://twitter.com/elonmusk/status/1330187635483140099,False,,0,,,,,,,,"[{'screen_name': 'philwhln', 'name': 'Phil Whelan 😷🇨🇦\U0001f995🚀', 'id': '13036132'}, {'screen_name': 'heydave7', 'name': 'Dave Lee', 'id': '29893444'}]",,,,


# **2. Parse Tweets into Components**

Your class assignment this semester had you manually clean tweets to reinforce your understanding of Python, RegEx and NLP using tweets. For the final class project, I encourage you to use text preprocessing libraries like 'preprocessor' illustrated below so you can focus on analysis and interpretation.

Unfortunately, the 'preprocessor' library is relatively new and has no written documentation as of Nov 2020 (see: https://preprocessor.readthedocs.io/en/latest/). By experimenting and looking at code reversed engineering the key functionality you may want to use in the code blocks below.

References:

* https://github.com/s/preprocessor
* https://towardsdatascience.com/basic-tweet-preprocessing-in-python-efd8360d529e 

## Python Library to clean tweets: preprocessor

Cleans tweets, customizable filters 
* Input: string
* Output: string

Ref: https://github.com/s/preprocessor

```
p.set_options(p.OPT.URL, p.OPT.EMOJI)
p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is #awesome'
```

Options are:
```
Option Name	Option Short Code
URL	p.OPT.URL
Mention	p.OPT.MENTION
Hashtag	p.OPT.HASHTAG
Reserved Words	p.OPT.RESERVED
Emoji	p.OPT.EMOJI
Smiley	p.OPT.SMILEY
Number	p.OPT.NUMBER
```

The next few code blocks will show you how the library 'preprocessor' can clean, parse and tokenize tweets

* More info at: https://github.com/s/preprocessor

In [17]:
!pip install tweet-preprocessor

Collecting tweet-preprocessor
  Downloading https://files.pythonhosted.org/packages/17/9d/71bd016a9edcef8860c607e531f30bd09b13103c7951ae73dd2bf174163c/tweet_preprocessor-0.6.0-py3-none-any.whl
Installing collected packages: tweet-preprocessor
Successfully installed tweet-preprocessor-0.6.0


In [18]:
import preprocessor as p

In [19]:
#install tweet-preprocessor to clean tweets
# https://towardsdatascience.com/twitter-sentiment-analysis-nlp-text-analytics-b7b296d71fce
# https://github.com/importdata/Twitter-Sentiment-Analysis/blob/master/Twitter_Sentiment_Analysis_Support_Vector_Classifier.ipynb

#set up punctuations we want to be replaced
REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\|)|(\()|(\))|(\[)|(\])|(\%)|(\$)|(\>)|(\<)|(\{)|(\})")
REPLACE_WITH_SPACE = re.compile("(<br\s/><br\s/?)|(-)|(/)|(:).")

def tweet_nopunct(astr):
  """input string is cleaned of punctuation and markup tags"""
  astr_clean = p.clean(astr)
  #remove puctuation
  astr_clean = REPLACE_NO_SPACE.sub("", astr_clean.lower()) # convert all tweets to lower cases
  astr_clean = REPLACE_WITH_SPACE.sub(" ", astr_clean)
  return astr_clean

In [20]:
# Test cleaning a tweet
 
# use 'p.set_options()' to filter out different types of tokens (e.g. URL, EMOJI, etc)
# if p.set_options() not called, clean will filter out everything but plain text
# if p.set_options() called, any p.OPT.x listed will be filtered out and unmentioned OPT will pass thru

# p.set_options(p.OPT.URL)
tweet_wpunct_str = p.clean('Preprocessor! is #awesome 👍 https://github.com/s/preprocessor')
print(tweet_wpunct_str)

Preprocessor! is


In [21]:
# Test cleaning a tweet with filters to remove tags and punctuation

tweet_wopunct_test = tweet_nopunct('Preprocessor! is #awesome 👍 https://github.com/s/preprocessor')
print(tweet_wopunct_test)

preprocessor is


In [22]:
def parseitem2list(api):
  """ Convert a preprocessor 'ParseItem' var into a Python list var """
  alist = []

  for i, val in enumerate(api):
    alist.append(val.match)
  
  return alist

In [23]:
def parse_tweet(tweet_str):
  """Parse the text of a tweet into sub-components and store in dict"""
  
  parsed_tweet_pi = p.parse(tweet_str)

  def parseitem2list(api):
    """ Convert a preprocessor 'ParseItem' var into a Python list var """
    alist = []

    for i, val in enumerate(api):
      alist.append(val.match)
    
    return alist  

  # convert our urls ParseItem to a standard Python list
  if (parsed_tweet_pi.urls):
    tweet_urls_ls = parseitem2list(parsed_tweet_pi.urls)
  else:
    tweet_urls_ls = []

  # convert our hashtags ParseItem to a standard Python list
  if (parsed_tweet_pi.hashtags):
    tweet_hashtags_ls = parseitem2list(parsed_tweet_pi.hashtags)
  else:
    tweet_hashtags_ls = []

  # convert our mentions ParseItem to a standard Python list
  if (parsed_tweet_pi.mentions):
    tweet_mentions_ls = parseitem2list(parsed_tweet_pi.mentions)
  else:
    tweet_mentions_ls = []

  # convert our emojis ParseItem to a standard Python list
  if (parsed_tweet_pi.emojis):
    tweet_emojis_ls = parseitem2list(parsed_tweet_pi.emojis)
  else:
    tweet_emojis_ls = []

  # convert our smileys ParseItem to a standard Python list
  if (parsed_tweet_pi.smileys):
    tweet_smileys_ls = parseitem2list(parsed_tweet_pi.smileys)
  else:
    tweet_smileys_ls = []

  # convert our numbers ParseItem to a standard Python list
  if (parsed_tweet_pi.numbers):
    tweet_numbers_ls = parseitem2list(parsed_tweet_pi.numbers)
  else:
    tweet_numbers_ls = []


  tweet_dt = {'urls': tweet_urls_ls, 
              'hashtags': tweet_hashtags_ls,
              'mentions': tweet_mentions_ls,
              'emojis': tweet_emojis_ls,
              'smileys': tweet_smileys_ls,
              'numbers': tweet_numbers_ls}

  return tweet_dt

In [24]:
# Test
atweet = '@bigfoot :o http://bigfoot.ai says FAV Preprocessor ;/ is #awesome 👍 https://github.com/s/preprocessor RT if you like this #kickarse'

parse_tweet(atweet) # ['emojis']

{'emojis': ['👍'],
 'hashtags': ['#awesome', '#kickarse'],
 'mentions': ['@bigfoot'],
 'numbers': [],
 'smileys': [':o', ';/ '],
 'urls': ['http://bigfoot.ai', 'https://github.com/s/preprocessor']}

In [25]:
# Test
atweet = '@littlehand ;< http://littlehand.ai says FAV Preprocessor >:/ is #terrible 👍 https://github.com/s/preprocessor RT if you hate this #sucks'

parse_tweet(atweet)

{'emojis': ['👍'],
 'hashtags': ['#terrible', '#sucks'],
 'mentions': ['@littlehand'],
 'numbers': [],
 'smileys': [':/ '],
 'urls': ['http://littlehand.ai', 'https://github.com/s/preprocessor']}

In [26]:
combined_df.head()

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,language,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,thumbnail,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,1334127507906314240,1333882505775050752,2020-12-02 13:29:19 UTC,2020-12-02,13:29:19,0,44196397,elonmusk,Elon Musk,,"@Tesmanian_com Award accepted on behalf of the great people at Tesla, SpaceX, Neuralink &amp; Boring Co",en,[],[],[],313,284,8385,[],[],https://twitter.com/elonmusk/status/1334127507906314240,False,,0,,,,,,,,"[{'screen_name': 'Tesmanian_com', 'name': 'Tesmanian.com', 'id': '1100520274200416256'}]",,,,
1,1331075177262661633,1330843965613027329,2020-11-24 03:20:27 UTC,2020-11-24,03:20:27,0,44196397,elonmusk,Elon Musk,,"@PPathole @Teslarati @TeslaRoadTrip We’re still far from simply video in, control out. The biggest game-changer, currently underway at Tesla, is 360 degree, high fps video for labeling, training &amp; inference.",en,[],[],[],135,97,2542,[],[],https://twitter.com/elonmusk/status/1331075177262661633,False,,0,,,,,,,,"[{'screen_name': 'PPathole', 'name': 'Pranay Pathole', 'id': '1291945442'}, {'screen_name': 'Teslarati', 'name': 'TESLARATI', 'id': '1308211178'}, {'screen_name': 'TeslaRoadTrip', 'name': 'TeslaRoadTrip', 'id': '1182382878'}]",,,,
2,1330982572038500355,1330980211509186560,2020-11-23 21:12:28 UTC,2020-11-23,21:12:28,0,44196397,elonmusk,Elon Musk,,@vincent13031925 @Tesla Wow,und,[],[],[],790,367,19411,[],[],https://twitter.com/elonmusk/status/1330982572038500355,False,,0,,,,,,,,"[{'screen_name': 'vincent13031925', 'name': 'Vincent 🚀\U0001f7e0', 'id': '1689516060'}, {'screen_name': 'Tesla', 'name': 'Tesla', 'id': '13298072'}]",,,,
3,1330206139385044999,1330179916587884544,2020-11-21 17:47:12 UTC,2020-11-21,17:47:12,0,44196397,elonmusk,Elon Musk,,@heydave7 @philwhln Tesla is a vehicle for creating &amp; producing many useful products,en,[],[],[],103,106,1941,[],[],https://twitter.com/elonmusk/status/1330206139385044999,False,,0,,,,,,,,"[{'screen_name': 'heydave7', 'name': 'Dave Lee', 'id': '29893444'}, {'screen_name': 'philwhln', 'name': 'Phil Whelan 😷🇨🇦\U0001f995🚀', 'id': '13036132'}]",,,,
4,1330187635483140099,1330179916587884544,2020-11-21 16:33:40 UTC,2020-11-21,16:33:40,0,44196397,elonmusk,Elon Musk,,"@philwhln @heydave7 Because I am not an investor. Tesla is definitely not the only good company, but investing is not what I do. But I always put my own money into companies I help create, otherwise it’d be wrong to ask others to do so.",en,[],[],[],130,145,2495,[],[],https://twitter.com/elonmusk/status/1330187635483140099,False,,0,,,,,,,,"[{'screen_name': 'philwhln', 'name': 'Phil Whelan 😷🇨🇦\U0001f995🚀', 'id': '13036132'}, {'screen_name': 'heydave7', 'name': 'Dave Lee', 'id': '29893444'}]",,,,


In [27]:
combined_df.iloc[-100:]['tweet']

1491    Tesla Supercharger network now energized from New York to LA, both coast + Texas! Approx 80% of US population covered.                         
1492    Tesla policy is to charge the same price (+ taxes &amp; shipping) everywhere in the world  http://t.co/cXOtokcBeG                              
1493    German govt reviews Tesla Model S fires. All due to high speed impacts, no injuries. Concludes: no defects, no recall  http://t.co/24iZzOSL3B  
1494    Tesla Model S Consumer Reports customer satisfaction survey highest of any car on road at 99/100  http://t.co/PpuS01S2KN                       
1495    Tesla is also extending the Model S warranty to cover any fire damage even if due solely to a driver accident                                  
1496    Why does a Tesla fire w no injury get more media headlines than 100,000 gas car fires that kill 100s of people per year?                       
1497    Mission of Tesla   http://t.co/UchbT5NZE3                                       

In [28]:
parse_tweet(combined_df.iloc[1015]['tweet'])

{'emojis': ['🐻', '🚘'],
 'hashtags': [],
 'mentions': ['@incentives101', '@Jason', '@Tesla'],
 'numbers': [],
 'smileys': [],
 'urls': []}

In [29]:
parse_tweet(combined_df.iloc[1500]['tweet'])['hashtags']

['#KatieWoodencloak']

In [30]:
# Split out identifable tweet subcomponents into separate columns before cleaning them further below

combined_df['hashtags'] = combined_df['tweet'].apply(lambda astr : parse_tweet(astr)['hashtags'])
combined_df['mentions'] = combined_df['tweet'].apply(lambda astr : parse_tweet(astr)['mentions'])
combined_df['urls'] = combined_df['tweet'].apply(lambda astr : parse_tweet(astr)['urls'])


In [31]:
# Split out identifable tweet subcomponents into separate columns before cleaning them further below

combined_df['emojis'] = combined_df['tweet'].apply(lambda astr : parse_tweet(astr)['emojis'])
combined_df['smileys'] = combined_df['tweet'].apply(lambda astr : parse_tweet(astr)['smileys'])
combined_df['numbers'] = combined_df['tweet'].apply(lambda astr : parse_tweet(astr)['numbers'])

In [32]:
combined_df['tweet'][1010:1015]

1010    @FredericLambert No such thing as a “full refresh” at Tesla or even a model year. Our cars are partially upgraded every month as soon as a new subsystem is ready for production. There is no cadence.
1011    The physics of how Tesla achieved best safety of any cars ever tested. Note, when vehicle weight is taken into account, order is more like X,S, then 3, but they are all very close.                  
1012    Tesla owner shows how well ultrawhite seats hold up after 25,000 miles. The black &amp; white interior is def best imo.  https://t.co/vWQ8X8JHYF                                                      
1013    @AsbjornLD @martinengwicht @Jason @Tesla Two — one for 👽 👾 millions of years from now and one for you                                                                                                 
1014    @martinengwicht @Jason @Tesla Yes                                                                                                                                   

In [33]:
combined_df[1010:1015]

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,language,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,thumbnail,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,emojis,smileys,numbers
1010,1049716203676037122,1049707571102306304,2018-10-09 17:40:20 UTC,2018-10-09,17:40:20,0,44196397,elonmusk,Elon Musk,,@FredericLambert No such thing as a “full refresh” at Tesla or even a model year. Our cars are partially upgraded every month as soon as a new subsystem is ready for production. There is no cadence.,en,[@FredericLambert],[],[],112,129,1229,[],[],https://twitter.com/elonmusk/status/1049716203676037122,False,,0,,,,,,,,"[{'screen_name': 'FredericLambert', 'name': 'Fred Lambert', 'id': '38253449'}]",,,,,[],[],[]
1011,1049324111367815169,1049324111367815169,2018-10-08 15:42:18 UTC,2018-10-08,15:42:18,0,44196397,elonmusk,Elon Musk,,"The physics of how Tesla achieved best safety of any cars ever tested. Note, when vehicle weight is taken into account, order is more like X,S, then 3, but they are all very close.",en,[],[],[],906,3047,27124,[],[],https://twitter.com/elonmusk/status/1049324111367815169,False,https://twitter.com/Tesla/status/1049284924321087488,0,,,,,,,,[],,,,,[],[],[ 3]
1012,1049045561792294912,1049045561792294912,2018-10-07 21:15:26 UTC,2018-10-07,21:15:26,0,44196397,elonmusk,Elon Musk,,"Tesla owner shows how well ultrawhite seats hold up after 25,000 miles. The black &amp; white interior is def best imo. https://t.co/vWQ8X8JHYF",en,[],[https://t.co/vWQ8X8JHYF],[],460,602,10058,[],[],https://twitter.com/elonmusk/status/1049045561792294912,False,,0,,,,,,,,[],,,,,[],[],"[ 25,000]"
1013,1048692170377506817,1048466642194259968,2018-10-06 21:51:11 UTC,2018-10-06,21:51:11,0,44196397,elonmusk,Elon Musk,,@AsbjornLD @martinengwicht @Jason @Tesla Two — one for 👽 👾 millions of years from now and one for you,en,"[@AsbjornLD, @martinengwicht, @Jason, @Tesla]",[],[],27,30,617,[],[],https://twitter.com/elonmusk/status/1048692170377506817,False,,0,,,,,,,,"[{'screen_name': 'MartinEngwicht', 'name': 'Starship Trooper', 'id': '1252353732672716800'}, {'screen_name': 'Jason', 'name': 'jason@calacanis.com', 'id': '3840'}, {'screen_name': 'Tesla', 'name': 'Tesla', 'id': '13298072'}]",,,,,"[👽, 👾]",[],[]
1014,1048690159875588096,1048466642194259968,2018-10-06 21:43:12 UTC,2018-10-06,21:43:12,0,44196397,elonmusk,Elon Musk,,@martinengwicht @Jason @Tesla Yes,und,"[@martinengwicht, @Jason, @Tesla]",[],[],14,15,578,[],[],https://twitter.com/elonmusk/status/1048690159875588096,False,,0,,,,,,,,"[{'screen_name': 'MartinEngwicht', 'name': 'Starship Trooper', 'id': '1252353732672716800'}, {'screen_name': 'Jason', 'name': 'jason@calacanis.com', 'id': '3840'}, {'screen_name': 'Tesla', 'name': 'Tesla', 'id': '13298072'}]",,,,,[],[],[]


In [34]:
# Convert columns to more specific dtype

combined_df = combined_df.convert_dtypes() # astype({'tweet':'str'})
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1591 entries, 0 to 1590
Data columns (total 39 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               1591 non-null   Int64  
 1   conversation_id  1591 non-null   Int64  
 2   created_at       1591 non-null   string 
 3   date             1591 non-null   string 
 4   time             1591 non-null   string 
 5   timezone         1591 non-null   Int64  
 6   user_id          1591 non-null   Int64  
 7   username         1591 non-null   string 
 8   name             1591 non-null   string 
 9   place            0 non-null      Int64  
 10  tweet            1591 non-null   string 
 11  language         1591 non-null   string 
 12  mentions         1591 non-null   object 
 13  urls             1591 non-null   object 
 14  photos           1591 non-null   string 
 15  replies_count    1591 non-null   Int64  
 16  retweets_count   1591 non-null   Int64  
 17  likes_count   

In [35]:
# test

pd.to_datetime('2020-11-08')

Timestamp('2020-11-08 00:00:00')

In [36]:
# Still should convert created_at to datetime type
# all_my_tweets_df = all_my_tweets_df.astype({"topic":str, "id":int, "username":str})
combined_df['created_at'] = pd.to_datetime(combined_df['created_at'], errors='ignore', yearfirst=True, infer_datetime_format=True) # = all_my_tweets_df['tweet_dt'].to_datetime()

In [37]:
# df['Date']= pd.to_datetime(df['Date'])
# combined_df['tweet_dt'] = pd.to_datetime(combined_df['tweet_dt'], errors='coerce')

In [38]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1591 entries, 0 to 1590
Data columns (total 39 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   id               1591 non-null   Int64              
 1   conversation_id  1591 non-null   Int64              
 2   created_at       1591 non-null   datetime64[ns, UTC]
 3   date             1591 non-null   string             
 4   time             1591 non-null   string             
 5   timezone         1591 non-null   Int64              
 6   user_id          1591 non-null   Int64              
 7   username         1591 non-null   string             
 8   name             1591 non-null   string             
 9   place            0 non-null      Int64              
 10  tweet            1591 non-null   string             
 11  language         1591 non-null   string             
 12  mentions         1591 non-null   object             
 13  urls             1

In [39]:
# Test to see if all columns have at least one value
print(combined_df[combined_df.isna().all(axis=1)])

Empty DataFrame
Columns: [id, conversation_id, created_at, date, time, timezone, user_id, username, name, place, tweet, language, mentions, urls, photos, replies_count, retweets_count, likes_count, hashtags, cashtags, link, retweet, quote_url, video, thumbnail, near, geo, source, user_rt_id, user_rt, retweet_id, reply_to, retweet_date, translate, trans_src, trans_dest, emojis, smileys, numbers]
Index: []


In [40]:
# drop any completely null tweets

combined_df.dropna(how='all', axis=0, inplace=True)

In [41]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1591 entries, 0 to 1590
Data columns (total 39 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   id               1591 non-null   Int64              
 1   conversation_id  1591 non-null   Int64              
 2   created_at       1591 non-null   datetime64[ns, UTC]
 3   date             1591 non-null   string             
 4   time             1591 non-null   string             
 5   timezone         1591 non-null   Int64              
 6   user_id          1591 non-null   Int64              
 7   username         1591 non-null   string             
 8   name             1591 non-null   string             
 9   place            0 non-null      Int64              
 10  tweet            1591 non-null   string             
 11  language         1591 non-null   string             
 12  mentions         1591 non-null   object             
 13  urls             1

In [42]:
# check if any columns are null

# print(combined_df[combined_df.isna().any(axis=1)])

In [43]:
combined_df.head()

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,language,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,thumbnail,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,emojis,smileys,numbers
0,1334127507906314240,1333882505775050752,2020-12-02 13:29:19+00:00,2020-12-02,13:29:19,0,44196397,elonmusk,Elon Musk,,"@Tesmanian_com Award accepted on behalf of the great people at Tesla, SpaceX, Neuralink &amp; Boring Co",en,[@Tesmanian_com],[],[],313,284,8385,[],[],https://twitter.com/elonmusk/status/1334127507906314240,False,,0,,,,,,,,"[{'screen_name': 'Tesmanian_com', 'name': 'Tesmanian.com', 'id': '1100520274200416256'}]",,,,,[],[],[]
1,1331075177262661633,1330843965613027329,2020-11-24 03:20:27+00:00,2020-11-24,03:20:27,0,44196397,elonmusk,Elon Musk,,"@PPathole @Teslarati @TeslaRoadTrip We’re still far from simply video in, control out. The biggest game-changer, currently underway at Tesla, is 360 degree, high fps video for labeling, training &amp; inference.",en,"[@PPathole, @Teslarati, @TeslaRoadTrip]",[],[],135,97,2542,[],[],https://twitter.com/elonmusk/status/1331075177262661633,False,,0,,,,,,,,"[{'screen_name': 'PPathole', 'name': 'Pranay Pathole', 'id': '1291945442'}, {'screen_name': 'Teslarati', 'name': 'TESLARATI', 'id': '1308211178'}, {'screen_name': 'TeslaRoadTrip', 'name': 'TeslaRoadTrip', 'id': '1182382878'}]",,,,,[],[],[ 360]
2,1330982572038500355,1330980211509186560,2020-11-23 21:12:28+00:00,2020-11-23,21:12:28,0,44196397,elonmusk,Elon Musk,,@vincent13031925 @Tesla Wow,und,"[@vincent13031925, @Tesla]",[],[],790,367,19411,[],[],https://twitter.com/elonmusk/status/1330982572038500355,False,,0,,,,,,,,"[{'screen_name': 'vincent13031925', 'name': 'Vincent 🚀\U0001f7e0', 'id': '1689516060'}, {'screen_name': 'Tesla', 'name': 'Tesla', 'id': '13298072'}]",,,,,[],[],[]
3,1330206139385044999,1330179916587884544,2020-11-21 17:47:12+00:00,2020-11-21,17:47:12,0,44196397,elonmusk,Elon Musk,,@heydave7 @philwhln Tesla is a vehicle for creating &amp; producing many useful products,en,"[@heydave7, @philwhln]",[],[],103,106,1941,[],[],https://twitter.com/elonmusk/status/1330206139385044999,False,,0,,,,,,,,"[{'screen_name': 'heydave7', 'name': 'Dave Lee', 'id': '29893444'}, {'screen_name': 'philwhln', 'name': 'Phil Whelan 😷🇨🇦\U0001f995🚀', 'id': '13036132'}]",,,,,[],[],[]
4,1330187635483140099,1330179916587884544,2020-11-21 16:33:40+00:00,2020-11-21,16:33:40,0,44196397,elonmusk,Elon Musk,,"@philwhln @heydave7 Because I am not an investor. Tesla is definitely not the only good company, but investing is not what I do. But I always put my own money into companies I help create, otherwise it’d be wrong to ask others to do so.",en,"[@philwhln, @heydave7]",[],[],130,145,2495,[],[],https://twitter.com/elonmusk/status/1330187635483140099,False,,0,,,,,,,,"[{'screen_name': 'philwhln', 'name': 'Phil Whelan 😷🇨🇦\U0001f995🚀', 'id': '13036132'}, {'screen_name': 'heydave7', 'name': 'Dave Lee', 'id': '29893444'}]",,,,,[],[],[]


In [44]:
# Create clean tweet from original tweet text

# expand contractions
combined_df['tweet_clean'] = combined_df['tweet'].apply(lambda astr : contractions.fix(astr))

# remove punctuation
combined_df['tweet_clean'] = combined_df['tweet_clean'].apply(lambda astr : tweet_nopunct(astr))

# collapse multiple-whitespaces to one whitespace
combined_df['tweet_clean'] = combined_df['tweet_clean'].apply(lambda astr : ' '.join(astr.split()))

# all_my_tweets_df['tweet_clean'] = all_my_tweets_df['tweet'].apply(lambda astr : p.clean(astr))
# ERR: all_my_tweets_df['tweet_clean'] = all_my_tweets_df['tweet'].str.apply(p.clean(astr)

In [45]:
combined_df['tweet_clean'][:40]

0     award accepted on behalf of the great people at tesla spacex neuralink &amp boring co                                                                                                                                                                                        
1     we are still far from simply video in control out the biggest game changer currently underway at tesla is degree high fps video for labeling training &amp inference                                                                                                         
2     wow                                                                                                                                                                                                                                                                          
3     tesla is a vehicle for creating &amp producing many useful products                                                                                                   

In [46]:
combined_df.head()

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,language,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,thumbnail,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,emojis,smileys,numbers,tweet_clean
0,1334127507906314240,1333882505775050752,2020-12-02 13:29:19+00:00,2020-12-02,13:29:19,0,44196397,elonmusk,Elon Musk,,"@Tesmanian_com Award accepted on behalf of the great people at Tesla, SpaceX, Neuralink &amp; Boring Co",en,[@Tesmanian_com],[],[],313,284,8385,[],[],https://twitter.com/elonmusk/status/1334127507906314240,False,,0,,,,,,,,"[{'screen_name': 'Tesmanian_com', 'name': 'Tesmanian.com', 'id': '1100520274200416256'}]",,,,,[],[],[],award accepted on behalf of the great people at tesla spacex neuralink &amp boring co
1,1331075177262661633,1330843965613027329,2020-11-24 03:20:27+00:00,2020-11-24,03:20:27,0,44196397,elonmusk,Elon Musk,,"@PPathole @Teslarati @TeslaRoadTrip We’re still far from simply video in, control out. The biggest game-changer, currently underway at Tesla, is 360 degree, high fps video for labeling, training &amp; inference.",en,"[@PPathole, @Teslarati, @TeslaRoadTrip]",[],[],135,97,2542,[],[],https://twitter.com/elonmusk/status/1331075177262661633,False,,0,,,,,,,,"[{'screen_name': 'PPathole', 'name': 'Pranay Pathole', 'id': '1291945442'}, {'screen_name': 'Teslarati', 'name': 'TESLARATI', 'id': '1308211178'}, {'screen_name': 'TeslaRoadTrip', 'name': 'TeslaRoadTrip', 'id': '1182382878'}]",,,,,[],[],[ 360],we are still far from simply video in control out the biggest game changer currently underway at tesla is degree high fps video for labeling training &amp inference
2,1330982572038500355,1330980211509186560,2020-11-23 21:12:28+00:00,2020-11-23,21:12:28,0,44196397,elonmusk,Elon Musk,,@vincent13031925 @Tesla Wow,und,"[@vincent13031925, @Tesla]",[],[],790,367,19411,[],[],https://twitter.com/elonmusk/status/1330982572038500355,False,,0,,,,,,,,"[{'screen_name': 'vincent13031925', 'name': 'Vincent 🚀\U0001f7e0', 'id': '1689516060'}, {'screen_name': 'Tesla', 'name': 'Tesla', 'id': '13298072'}]",,,,,[],[],[],wow
3,1330206139385044999,1330179916587884544,2020-11-21 17:47:12+00:00,2020-11-21,17:47:12,0,44196397,elonmusk,Elon Musk,,@heydave7 @philwhln Tesla is a vehicle for creating &amp; producing many useful products,en,"[@heydave7, @philwhln]",[],[],103,106,1941,[],[],https://twitter.com/elonmusk/status/1330206139385044999,False,,0,,,,,,,,"[{'screen_name': 'heydave7', 'name': 'Dave Lee', 'id': '29893444'}, {'screen_name': 'philwhln', 'name': 'Phil Whelan 😷🇨🇦\U0001f995🚀', 'id': '13036132'}]",,,,,[],[],[],tesla is a vehicle for creating &amp producing many useful products
4,1330187635483140099,1330179916587884544,2020-11-21 16:33:40+00:00,2020-11-21,16:33:40,0,44196397,elonmusk,Elon Musk,,"@philwhln @heydave7 Because I am not an investor. Tesla is definitely not the only good company, but investing is not what I do. But I always put my own money into companies I help create, otherwise it’d be wrong to ask others to do so.",en,"[@philwhln, @heydave7]",[],[],130,145,2495,[],[],https://twitter.com/elonmusk/status/1330187635483140099,False,,0,,,,,,,,"[{'screen_name': 'philwhln', 'name': 'Phil Whelan 😷🇨🇦\U0001f995🚀', 'id': '13036132'}, {'screen_name': 'heydave7', 'name': 'Dave Lee', 'id': '29893444'}]",,,,,[],[],[],because i am not an investor tesla is definitely not the only good company but investing is not what i do but i always put my own money into companies i help create otherwise it would be wrong to ask others to do so


# **3. Write Parsed & Cleaned Tweets to datafile.csv**

In [47]:
# Create unique output filename using current datetime stamp

file_name_cleaned = ('cleaned_' + file_name_all.split('.')[0] + '.csv') # 'tweets_combined_20201201-012404.csv'
print(file_name_cleaned)

combined_df.to_csv(file_name_cleaned)

cleaned_tweets_combined_20201204-193123.csv


In [48]:
!ls -al cleaned_*

-rw------- 1 root root 986297 Dec  8 23:33 cleaned_tweets_combined_20201204-193123.csv


In [49]:
!head -5 $file_name_cleaned

,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,language,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,thumbnail,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,emojis,smileys,numbers,tweet_clean
0,1334127507906314240,1333882505775050752,2020-12-02 13:29:19+00:00,2020-12-02,13:29:19,0,44196397,elonmusk,Elon Musk,,"@Tesmanian_com Award accepted on behalf of the great people at Tesla, SpaceX, Neuralink &amp; Boring Co",en,['@Tesmanian_com'],[],[],313,284,8385,[],[],https://twitter.com/elonmusk/status/1334127507906314240,False,,0,,,,,,,,"[{'screen_name': 'Tesmanian_com', 'name': 'Tesmanian.com', 'id': '1100520274200416256'}]",,,,,[],[],[],award accepted on behalf of the great people at tesla spacex neuralink &amp boring co
1,1331075177262661633,1330843965613027329,2020-11-24 03:20:27+00:00,2020-11-24,03:20:27,0,44196397,elonmusk,Elon Musk,,"

# **4. Write Plain Text of Tweets for Traditional Text Analytics**

In [50]:
import time

In [51]:
# Create a filename of search results in a file with the name contained a unique timestamp
timestr = time.strftime("%Y%m%d-%H%M%S")
file_name_plaintext = f"tweets_plaintext_{timestr}.txt"

print(file_name_plaintext)

tweets_plaintext_20201208-233325.txt


In [52]:
# Write all tweets combined into one big file

combined_df.to_csv(file_name_plaintext, columns=['tweet_clean'], index=False, header=False, encoding='utf-8')

In [53]:
# Write a copy of the plaintext tweets with lines in reversed order
#   so the oldest tweets are first. This is reversed file is the one
#   you should use to visualize any timeseries like voyant-tools.org
#   which assumes the oldest tweets/datapoints are the begin on the
#   first line of a file and plot them left/oldest to right/newest on
#   the x-axis of time series visualizations.

# Create a reverse filename of search results in a file with the name contained a unique timestamp
timestr = time.strftime("%Y%m%d-%H%M%S")
file_name_plaintext_rev = f"tweets_plaintext_rev_{timestr}.txt"

print(file_name_plaintext_rev)

tweets_plaintext_rev_20201208-233326.txt


In [63]:
# Create a 'reversed' order of the plaintext file using 
#   simple UNIX command line utilities

!tac $file_name_plaintext > $file_name_plaintext_rev

In [65]:
# Check to make sure both the regular and the reversed row order plaintext tweetfiles
#   exists and are the same size

!ls -al tweets_plaintext_*

-rw------- 1 root root 148261 Dec  5 14:59 tweets_plaintext_20201205-145558.txt
-rw------- 1 root root 148263 Dec  8 23:33 tweets_plaintext_20201208-233325.txt
-rw------- 1 root root 148263 Dec  8 23:37 tweets_plaintext_rev_20201208-233326.txt


In [67]:
# Check that the first 5 lines of the regular plaintext tweetfile...

!head -n 5 $file_name_plaintext

award accepted on behalf of the great people at tesla spacex neuralink &amp boring co
we are still far from simply video in control out the biggest game changer currently underway at tesla is degree high fps video for labeling training &amp inference
wow
tesla is a vehicle for creating &amp producing many useful products
because i am not an investor tesla is definitely not the only good company but investing is not what i do but i always put my own money into companies i help create otherwise it would be wrong to ask others to do so


In [68]:
# ...match the last 5 lines of the reversed plaintext tweetfile

!tail -n 5 $file_name_plaintext_rev

because i am not an investor tesla is definitely not the only good company but investing is not what i do but i always put my own money into companies i help create otherwise it would be wrong to ask others to do so
tesla is a vehicle for creating &amp producing many useful products
wow
we are still far from simply video in control out the biggest game changer currently underway at tesla is degree high fps video for labeling training &amp inference
award accepted on behalf of the great people at tesla spacex neuralink &amp boring co


In [69]:
# Check that the last 5 lines of the regular plaintext tweetfile...

!tail -n 5 $file_name_plaintext

that is not just paranoia a healthy trait at times tesla really is under massive attack by short sellers
will communicate better in the future too many people want us to fail and are willing to twist any bit of news against tesla
a tesla roadster just passed the mile mark for the first time and still has over miles of range
the exec conf room at tesla used to be called denali but i decided to move a few letters around seemed more apt
hacked my tesla charge connector on a small island in the rain last night


In [70]:
# ...match the first 5 lines of the reversed plaintext tweetfile

!head -n 5 $file_name_plaintext_rev

hacked my tesla charge connector on a small island in the rain last night
the exec conf room at tesla used to be called denali but i decided to move a few letters around seemed more apt
a tesla roadster just passed the mile mark for the first time and still has over miles of range
will communicate better in the future too many people want us to fail and are willing to twist any bit of news against tesla
that is not just paranoia a healthy trait at times tesla really is under massive attack by short sellers


In [None]:
!head -n 10 $file_name_plaintext

award accepted on behalf of the great people at tesla spacex neuralink &amp boring co
we are still far from simply video in control out the biggest game changer currently underway at tesla is degree high fps video for labeling training &amp inference
wow
tesla is a vehicle for creating &amp producing many useful products
because i am not an investor tesla is definitely not the only good company but investing is not what i do but i always put my own money into companies i help create otherwise it would be wrong to ask others to do so
andrej is awesome but it should be said that we have a very talented autopilot ai team at tesla too much credit comes to me &amp andrej
safety is our primary design goal
a lot of my brain space is spent dealing with both units
the only publicly traded stock i own is tesla
tesla holiday software release is


In [80]:
# If the previous 4 cell block show confirm the reversed file is correct
#   download the file

# !gsutil -q -m cp $file_name_plaintext_rev $dir_working

# NOTE: the *_rev file will be in the ./twint/ subdirectory

!cp $file_name_plaintext_rev $dir_working # "/content/drive/My Drive/"

In [76]:
# Optional: Uncomment to force a download of the reversed plaintext tweet file

print(file_name_plaintext)

from google.colab import files
files.download(file_name_plaintext) 

tweets_plaintext_20201208-233325.txt


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>