# **Twitter Step 2: Parse and Clean Tweets**
By: Jon Chun
30 Nov 2020

* Parse tweets into components (e.g. hashtags, emojis, etc)
* Clean the main text of the tweets (e.g. lowercase, remove punct, etc)

Reference:

* https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/

# **0. Setup Environment**

## You will need to give permission for this Colab to link to your gdrive in the code cell below

In [1]:
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

Mounted at /gdrive
/gdrive


In [2]:
# CUSTOMIZE: if you want your work and twitter datasets saved into a specific folder
#            beneath your gdrive root directory, define it below

%cd ./MyDrive/courses/2020f_iphs200_programming_humanity/code/

/gdrive/MyDrive/courses/2020f_iphs200_programming_humanity/code


In [178]:
!ls *.csv

cleaned_tweets_combined_20201201-012404.csv
tweets_combined_20201201-012404.csv
tweets_seattle_all.csv
tweets_twint_donald_trump__20201201-012233.csv
tweets_twint_election_win__20201201-012219.csv
tweets_twint_future__20201201-012121.csv
tweets_twint_vote_court__20201201-012309.csv


In [4]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [47]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

  after removing the cwd from sys.path.


In [18]:
import os
import re
import glob


In [147]:
!pip install contractions

Collecting contractions
  Downloading https://files.pythonhosted.org/packages/00/92/a05b76a692ac08d470ae5c23873cf1c9a041532f1ee065e74b374f218306/contractions-0.0.25-py2.py3-none-any.whl
Collecting textsearch
  Downloading https://files.pythonhosted.org/packages/42/a8/03407021f9555043de5492a2bd7a35c56cc03c2510092b5ec018cae1bbf1/textsearch-0.0.17-py2.py3-none-any.whl
Collecting Unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 5.7MB/s 
[?25hCollecting pyahocorasick
[?25l  Downloading https://files.pythonhosted.org/packages/f4/9f/f0d8e8850e12829eea2e778f1c90e3c53a9a799b7f412082a5d21cd19ae1/pyahocorasick-1.4.0.tar.gz (312kB)
[K     |████████████████████████████████| 317kB 26.2MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  

In [148]:
import contractions

# **1. Read in Combined Tweet Dataset File**

In [179]:
!ls *.csv

cleaned_tweets_combined_20201201-012404.csv
tweets_combined_20201201-012404.csv
tweets_seattle_all.csv
tweets_twint_donald_trump__20201201-012233.csv
tweets_twint_election_win__20201201-012219.csv
tweets_twint_future__20201201-012121.csv
tweets_twint_vote_court__20201201-012309.csv


In [7]:
# CONFIGURE: Set the 'file_name_all' to the name of the combined datafile with all the tweets
#            which should be listed in the previous code cell

file_name_all = 'tweets_combined_20201201-012404.csv'

In [83]:
combined_df = pd.read_csv(file_name_all, encoding='utf-8')
combined_df = combined_df.convert_dtypes()
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1060 entries, 0 to 1059
Data columns (total 36 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               1060 non-null   Int64  
 1   conversation_id  1060 non-null   Int64  
 2   created_at       1060 non-null   string 
 3   date             1060 non-null   string 
 4   time             1060 non-null   string 
 5   timezone         1060 non-null   Int64  
 6   user_id          1060 non-null   Int64  
 7   username         1060 non-null   string 
 8   name             1060 non-null   string 
 9   place            0 non-null      Int64  
 10  tweet            1060 non-null   string 
 11  language         1060 non-null   string 
 12  mentions         1060 non-null   string 
 13  urls             1060 non-null   string 
 14  photos           1060 non-null   string 
 15  replies_count    1060 non-null   Int64  
 16  retweets_count   1060 non-null   Int64  
 17  likes_count   

In [84]:
combined_df.shape

(1060, 36)

In [180]:
combined_df.head()

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,language,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,thumbnail,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,emojis,smileys,numbers,tweet_clean
0,1332356759994970113,1332356759994970113,2020-11-27 16:13:00+00:00,2020-11-27,16:13:00,0,939091,joebiden,Joe Biden,,"This Native American Heritage Day, we give thanks to our Indigenous communities and their ancestors. As we celebrate their rich heritage and contributions, let’s commit to writing a new future together — one built on a strong partnership and filled with opportunity for all.",en,[],[],[],6351,21601,248367,[],[],https://twitter.com/JoeBiden/status/1332356759994970113,False,,0,,,,,,,,[],,,,,[],[],[],this native american heritage day we give thanks to our indigenous communities and their ancestors as we celebrate their rich heritage and contributions let us commit to writing a new future together one built on a strong partnership and filled with opportunity for all
1,1323473727447654400,1323473727447654400,2020-11-03 03:55:00+00:00,2020-11-03,03:55:00,0,939091,joebiden,Joe Biden,,I’ve said it many times: I’m more optimistic about America’s future today than I was when I got elected to the United States Senate as a 29-year-old.,en,[],[],[],4359,5375,79312,[],[],https://twitter.com/JoeBiden/status/1323473727447654400,False,,0,,,,,,,,[],,,,,[],[],[ 29],i have said it many times i am more optimistic about americas future today than i was when i got elected to the united states senate as a year old
2,1323391493885718528,1323391493885718528,2020-11-02 22:28:14+00:00,2020-11-02,22:28:14,0,939091,joebiden,Joe Biden,,I’m speaking with members of the African American community in Pittsburgh about the power of the vote — and the future we can build together. Tune in. https://t.co/1wFBiLoCWu,en,[],[https://t.co/1wFBiLoCWu],[],1818,1994,13823,[],[],https://twitter.com/JoeBiden/status/1323391493885718528,False,,0,,,,,,,,[],,,,,[],[],[],i am speaking with members of the african american community in pittsburgh about the power of the vote and the future we can build together tune in
3,1322959086544211973,1322959086544211973,2020-11-01 17:50:00+00:00,2020-11-01,17:50:00,0,939091,joebiden,Joe Biden,,We can build a future where: - Health care is a right - We end the gun violence epidemic - We combat climate change - Our government works for everyone Vote. https://t.co/eoxT07d7QB,en,[],[https://t.co/eoxT07d7QB],[],1603,2581,18087,[],[],https://twitter.com/JoeBiden/status/1322959086544211973,False,,0,,,,,,,,[],,,,,[],[],[],we can build a future where health care is a right we end the gun violence epidemic we combat climate change our government works for everyone vote
4,1322927509260902401,1322927509260902401,2020-11-01 15:44:31+00:00,2020-11-01,15:44:31,0,939091,joebiden,Joe Biden,,The future of our planet is on the ballot. Vote: https://t.co/eoxT07d7QB,en,[],[https://t.co/eoxT07d7QB],[],2155,4513,43222,[],[],https://twitter.com/JoeBiden/status/1322927509260902401,False,,0,,,,,,,,[],,,,,[],[],[],the future of our planet is on the ballot vote


# **2. Parse Tweets into Components**

Your class assignment this semester had you manually clean tweets to reinforce your understanding of Python, RegEx and NLP using tweets. For the final class project, I encourage you to use text preprocessing libraries like 'preprocessor' illustrated below so you can focus on analysis and interpretation.

Unfortunately, the 'preprocessor' library is relatively new and has no written documentation as of Nov 2020 (see: https://preprocessor.readthedocs.io/en/latest/). By experimenting and looking at code reversed engineering the key functionality you may want to use in the code blocks below.

References:

* https://github.com/s/preprocessor
* https://towardsdatascience.com/basic-tweet-preprocessing-in-python-efd8360d529e 

## Python Library to clean tweets: preprocessor

Cleans tweets, customizable filters 
* Input: string
* Output: string

Ref: https://github.com/s/preprocessor

```
p.set_options(p.OPT.URL, p.OPT.EMOJI)
p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is #awesome'
```

Options are:
```
Option Name	Option Short Code
URL	p.OPT.URL
Mention	p.OPT.MENTION
Hashtag	p.OPT.HASHTAG
Reserved Words	p.OPT.RESERVED
Emoji	p.OPT.EMOJI
Smiley	p.OPT.SMILEY
Number	p.OPT.NUMBER
```

The next few code blocks will show you how the library 'preprocessor' can clean, parse and tokenize tweets

* More info at: https://github.com/s/preprocessor

In [15]:
!pip install tweet-preprocessor

Collecting tweet-preprocessor
  Downloading https://files.pythonhosted.org/packages/17/9d/71bd016a9edcef8860c607e531f30bd09b13103c7951ae73dd2bf174163c/tweet_preprocessor-0.6.0-py3-none-any.whl
Installing collected packages: tweet-preprocessor
Successfully installed tweet-preprocessor-0.6.0


In [16]:
import preprocessor as p

In [121]:
#install tweet-preprocessor to clean tweets
# https://towardsdatascience.com/twitter-sentiment-analysis-nlp-text-analytics-b7b296d71fce
# https://github.com/importdata/Twitter-Sentiment-Analysis/blob/master/Twitter_Sentiment_Analysis_Support_Vector_Classifier.ipynb

#set up punctuations we want to be replaced
REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\|)|(\()|(\))|(\[)|(\])|(\%)|(\$)|(\>)|(\<)|(\{)|(\})")
REPLACE_WITH_SPACE = re.compile("(<br\s/><br\s/?)|(-)|(/)|(:).")

def tweet_nopunct(astr):
  """input string is cleaned of punctuation and markup tags"""
  astr_clean = p.clean(astr)
  #remove puctuation
  astr_clean = REPLACE_NO_SPACE.sub("", astr_clean.lower()) # convert all tweets to lower cases
  astr_clean = REPLACE_WITH_SPACE.sub(" ", astr_clean)
  return astr_clean

In [20]:
# Test cleaning a tweet
 
# use 'p.set_options()' to filter out different types of tokens (e.g. URL, EMOJI, etc)
# if p.set_options() not called, clean will filter out everything but plain text
# if p.set_options() called, any p.OPT.x listed will be filtered out and unmentioned OPT will pass thru

# p.set_options(p.OPT.URL)
tweet_wpunct_str = p.clean('Preprocessor! is #awesome 👍 https://github.com/s/preprocessor')
print(tweet_wpunct_str)

Preprocessor! is


In [181]:
# Test cleaning a tweet with filters to remove tags and punctuation

tweet_wopunct_test = tweet_nopunct('Preprocessor! is #awesome 👍 https://github.com/s/preprocessor')
print(tweet_wopunct_test)

preprocessor is


In [27]:
def parseitem2list(api):
  """ Convert a preprocessor 'ParseItem' var into a Python list var """
  alist = []

  for i, val in enumerate(api):
    alist.append(val.match)
  
  return alist

In [28]:
# Test

# convert our urls ParseItem to a standard Python list

tweet_urls_ls = parseitem2list(tweet_urls_pi)
print(tweet_urls_ls)

['http://bigfoot.ai', 'https://github.com/s/preprocessor']


In [29]:
# Test
parsed_tweet_pi.urls

[(9:26) => http://bigfoot.ai, (62:95) => https://github.com/s/preprocessor]

In [34]:
def parse_tweet(tweet_str):
  """Parse the text of a tweet into sub-components and store in dict"""
  
  parsed_tweet_pi = p.parse(tweet_str)

  def parseitem2list(api):
    """ Convert a preprocessor 'ParseItem' var into a Python list var """
    alist = []

    for i, val in enumerate(api):
      alist.append(val.match)
    
    return alist  

  # convert our urls ParseItem to a standard Python list
  if (parsed_tweet_pi.urls):
    tweet_urls_ls = parseitem2list(parsed_tweet_pi.urls)
  else:
    tweet_urls_ls = []

  # convert our hashtags ParseItem to a standard Python list
  if (parsed_tweet_pi.hashtags):
    tweet_hashtags_ls = parseitem2list(parsed_tweet_pi.hashtags)
  else:
    tweet_hashtags_ls = []

  # convert our mentions ParseItem to a standard Python list
  if (parsed_tweet_pi.mentions):
    tweet_mentions_ls = parseitem2list(parsed_tweet_pi.mentions)
  else:
    tweet_mentions_ls = []

  # convert our emojis ParseItem to a standard Python list
  if (parsed_tweet_pi.emojis):
    tweet_emojis_ls = parseitem2list(parsed_tweet_pi.emojis)
  else:
    tweet_emojis_ls = []

  # convert our smileys ParseItem to a standard Python list
  if (parsed_tweet_pi.smileys):
    tweet_smileys_ls = parseitem2list(parsed_tweet_pi.smileys)
  else:
    tweet_smileys_ls = []

  # convert our numbers ParseItem to a standard Python list
  if (parsed_tweet_pi.numbers):
    tweet_numbers_ls = parseitem2list(parsed_tweet_pi.numbers)
  else:
    tweet_numbers_ls = []


  tweet_dt = {'urls': tweet_urls_ls, 
              'hashtags': tweet_hashtags_ls,
              'mentions': tweet_mentions_ls,
              'emojis': tweet_emojis_ls,
              'smileys': tweet_smileys_ls,
              'numbers': tweet_numbers_ls}

  return tweet_dt

In [50]:
# Test
atweet = '@bigfoot :o http://bigfoot.ai says FAV Preprocessor ;/ is #awesome 👍 https://github.com/s/preprocessor RT if you like this #kickarse'

parse_tweet(atweet) # ['emojis']

{'emojis': ['👍'],
 'hashtags': ['#awesome', '#kickarse'],
 'mentions': ['@bigfoot'],
 'numbers': [],
 'smileys': [':o', ';/ '],
 'urls': ['http://bigfoot.ai', 'https://github.com/s/preprocessor']}

In [92]:
# Test
atweet = '@littlehand ;< http://littlehand.ai says FAV Preprocessor >:/ is #terrible 👍 https://github.com/s/preprocessor RT if you hate this #sucks'

parse_tweet(atweet)

{'emojis': ['👍'],
 'hashtags': ['#terrible', '#sucks'],
 'mentions': ['@littlehand'],
 'numbers': [],
 'smileys': [':/ '],
 'urls': ['http://littlehand.ai', 'https://github.com/s/preprocessor']}

In [93]:
combined_df.iloc[12]['tweet']

"I couldn't be more excited to have my friend @BarackObama hitting the campaign trail to talk about what's at stake in this election.  As he said in Philadelphia, we can't just imagine a better future — we have to fight for it and vote like never before:  https://t.co/eoxT07d7QB  https://t.co/vxQ9eE9k9k"

In [90]:
parse_tweet(combined_df.iloc[12]['tweet'])

{'emojis': [],
 'hashtags': [],
 'mentions': ['@BarackObama'],
 'numbers': [],
 'smileys': [],
 'urls': ['https://t.co/eoxT07d7QB', 'https://t.co/vxQ9eE9k9k']}

In [64]:
parse_tweet(combined_df.iloc[12]['tweet'])['hashtags']

[]

In [94]:
# Split out identifable tweet subcomponents into separate columns before cleaning them further below

combined_df['hashtags'] = combined_df['tweet'].apply(lambda astr : parse_tweet(astr)['hashtags'])
combined_df['mentions'] = combined_df['tweet'].apply(lambda astr : parse_tweet(astr)['mentions'])
combined_df['urls'] = combined_df['tweet'].apply(lambda astr : parse_tweet(astr)['urls'])


In [95]:
# Split out identifable tweet subcomponents into separate columns before cleaning them further below

combined_df['emojis'] = combined_df['tweet'].apply(lambda astr : parse_tweet(astr)['emojis'])
combined_df['smileys'] = combined_df['tweet'].apply(lambda astr : parse_tweet(astr)['smileys'])
combined_df['numbers'] = combined_df['tweet'].apply(lambda astr : parse_tweet(astr)['numbers'])

In [68]:
combined_df['tweet'][10:20]

10    The future of our country is on the ballot — and you get to decide what it looks like.  Vote:  https://t.co/eoxT07d7QB                                                                                                                                                                                         
11    The future of our planet is on the ballot.  https://t.co/fWtwYOmycR                                                                                                                                                                                                                                            
12    I couldn't be more excited to have my friend @BarackObama hitting the campaign trail to talk about what's at stake in this election.  As he said in Philadelphia, we can't just imagine a better future — we have to fight for it and vote like never before:  https://t.co/eoxT07d7QB  https://t.co/vxQ9eE9k9k
13    Imagine a day in the not too distant future when you can enjoy a

In [96]:
combined_df[10:15]

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,language,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,thumbnail,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,emojis,smileys,numbers
10,1321926461054423040,1321926461054423040,2020-10-29 21:26:43+00:00,2020-10-29,21:26:43,0,939091,joebiden,Joe Biden,,The future of our country is on the ballot — and you get to decide what it looks like. Vote: https://t.co/eoxT07d7QB,en,[],[https://t.co/eoxT07d7QB],[],3442,5423,45838,[],[],https://twitter.com/JoeBiden/status/1321926461054423040,False,,0,,,,,,,,[],,,,,[],[],[]
11,1321600635222052866,1321600635222052866,2020-10-28 23:52:00+00:00,2020-10-28,23:52:00,0,939091,joebiden,Joe Biden,,The future of our planet is on the ballot. https://t.co/fWtwYOmycR,en,[],[https://t.co/fWtwYOmycR],[],2538,5280,25978,[],[],https://twitter.com/JoeBiden/status/1321600635222052866,False,,1,https://pbs.twimg.com/media/EldEJKQXEAEN15c.jpg,,,,,,,[],,,,,[],[],[]
12,1320850190325194754,1320850190325194754,2020-10-26 22:10:00+00:00,2020-10-26,22:10:00,0,939091,joebiden,Joe Biden,,"I couldn't be more excited to have my friend @BarackObama hitting the campaign trail to talk about what's at stake in this election. As he said in Philadelphia, we can't just imagine a better future — we have to fight for it and vote like never before: https://t.co/eoxT07d7QB https://t.co/vxQ9eE9k9k",en,[@BarackObama],"[https://t.co/eoxT07d7QB, https://t.co/vxQ9eE9k9k]",[],2903,6084,35530,[],[],https://twitter.com/JoeBiden/status/1320850190325194754,False,,1,https://pbs.twimg.com/media/ElSYRAlXYAINBio.jpg,,,,,,,[],,,,,[],[],[]
13,1320122899655634944,1320122899655634944,2020-10-24 22:00:00+00:00,2020-10-24,22:00:00,0,939091,joebiden,Joe Biden,,"Imagine a day in the not too distant future when you can enjoy a dinner out with your friends, a night at the movies, or when you can celebrate your birthday, wedding, or graduation surrounded by your nearest and dearest. We can get there — together. https://t.co/uVRpnIrirz",en,[],[https://t.co/uVRpnIrirz],[],6833,6484,43530,[],[],https://twitter.com/JoeBiden/status/1320122899655634944,False,,1,https://pbs.twimg.com/media/ElIDflmWkAUueZo.jpg,,,,,,,[],,,,,[],[],[]
14,1318352230458609664,1318352230458609664,2020-10-20 00:44:00+00:00,2020-10-20,00:44:00,0,939091,joebiden,Joe Biden,,I will be a president who pushes towards the future. Not one who clings to the past.,en,[],[],[],29956,33151,350740,[],[],https://twitter.com/JoeBiden/status/1318352230458609664,False,,0,,,,,,,,[],,,,,[],[],[]


In [97]:
# Convert columns to more specific dtype

combined_df = combined_df.convert_dtypes() # astype({'tweet':'str'})
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1060 entries, 0 to 1059
Data columns (total 39 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   id               1060 non-null   Int64              
 1   conversation_id  1060 non-null   Int64              
 2   created_at       1060 non-null   datetime64[ns, UTC]
 3   date             1060 non-null   string             
 4   time             1060 non-null   string             
 5   timezone         1060 non-null   Int64              
 6   user_id          1060 non-null   Int64              
 7   username         1060 non-null   string             
 8   name             1060 non-null   string             
 9   place            0 non-null      Int64              
 10  tweet            1060 non-null   string             
 11  language         1060 non-null   string             
 12  mentions         1060 non-null   object             
 13  urls             1

In [87]:
# test

pd.to_datetime('2020-11-08')

Timestamp('2020-11-08 00:00:00')

In [98]:
# Still should convert created_at to datetime type
# all_my_tweets_df = all_my_tweets_df.astype({"topic":str, "id":int, "username":str})
combined_df['created_at'] = pd.to_datetime(combined_df['created_at'], errors='ignore', yearfirst=True, infer_datetime_format=True) # = all_my_tweets_df['tweet_dt'].to_datetime()

In [99]:
# df['Date']= pd.to_datetime(df['Date'])
# combined_df['tweet_dt'] = pd.to_datetime(combined_df['tweet_dt'], errors='coerce')

In [100]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1060 entries, 0 to 1059
Data columns (total 39 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   id               1060 non-null   Int64              
 1   conversation_id  1060 non-null   Int64              
 2   created_at       1060 non-null   datetime64[ns, UTC]
 3   date             1060 non-null   string             
 4   time             1060 non-null   string             
 5   timezone         1060 non-null   Int64              
 6   user_id          1060 non-null   Int64              
 7   username         1060 non-null   string             
 8   name             1060 non-null   string             
 9   place            0 non-null      Int64              
 10  tweet            1060 non-null   string             
 11  language         1060 non-null   string             
 12  mentions         1060 non-null   object             
 13  urls             1

In [101]:
# Test to see if all columns have at least one value
print(combined_df[combined_df.isna().all(axis=1)])

Empty DataFrame
Columns: [id, conversation_id, created_at, date, time, timezone, user_id, username, name, place, tweet, language, mentions, urls, photos, replies_count, retweets_count, likes_count, hashtags, cashtags, link, retweet, quote_url, video, thumbnail, near, geo, source, user_rt_id, user_rt, retweet_id, reply_to, retweet_date, translate, trans_src, trans_dest, emojis, smileys, numbers]
Index: []


In [102]:
# drop any completely null tweets

combined_df.dropna(how='all', axis=0, inplace=True)

In [103]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1060 entries, 0 to 1059
Data columns (total 39 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   id               1060 non-null   Int64              
 1   conversation_id  1060 non-null   Int64              
 2   created_at       1060 non-null   datetime64[ns, UTC]
 3   date             1060 non-null   string             
 4   time             1060 non-null   string             
 5   timezone         1060 non-null   Int64              
 6   user_id          1060 non-null   Int64              
 7   username         1060 non-null   string             
 8   name             1060 non-null   string             
 9   place            0 non-null      Int64              
 10  tweet            1060 non-null   string             
 11  language         1060 non-null   string             
 12  mentions         1060 non-null   object             
 13  urls             1

In [None]:
# check if any columns are null

print(combined_df[combined_df.isna().any(axis=1)])

Empty DataFrame
Columns: [Unnamed: 0, tweet_dt, topic, id, username, name, tweet, like_count, reply_count, retweet_count, retweeted, tweet_clean, emojis, hashtags, mentions, urls, numbers]
Index: []


In [104]:
combined_df.head()

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,language,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,thumbnail,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,emojis,smileys,numbers
0,1332356759994970113,1332356759994970113,2020-11-27 16:13:00+00:00,2020-11-27,16:13:00,0,939091,joebiden,Joe Biden,,"This Native American Heritage Day, we give thanks to our Indigenous communities and their ancestors. As we celebrate their rich heritage and contributions, let’s commit to writing a new future together — one built on a strong partnership and filled with opportunity for all.",en,[],[],[],6351,21601,248367,[],[],https://twitter.com/JoeBiden/status/1332356759994970113,False,,0,,,,,,,,[],,,,,[],[],[]
1,1323473727447654400,1323473727447654400,2020-11-03 03:55:00+00:00,2020-11-03,03:55:00,0,939091,joebiden,Joe Biden,,I’ve said it many times: I’m more optimistic about America’s future today than I was when I got elected to the United States Senate as a 29-year-old.,en,[],[],[],4359,5375,79312,[],[],https://twitter.com/JoeBiden/status/1323473727447654400,False,,0,,,,,,,,[],,,,,[],[],[ 29]
2,1323391493885718528,1323391493885718528,2020-11-02 22:28:14+00:00,2020-11-02,22:28:14,0,939091,joebiden,Joe Biden,,I’m speaking with members of the African American community in Pittsburgh about the power of the vote — and the future we can build together. Tune in. https://t.co/1wFBiLoCWu,en,[],[https://t.co/1wFBiLoCWu],[],1818,1994,13823,[],[],https://twitter.com/JoeBiden/status/1323391493885718528,False,,0,,,,,,,,[],,,,,[],[],[]
3,1322959086544211973,1322959086544211973,2020-11-01 17:50:00+00:00,2020-11-01,17:50:00,0,939091,joebiden,Joe Biden,,We can build a future where: - Health care is a right - We end the gun violence epidemic - We combat climate change - Our government works for everyone Vote. https://t.co/eoxT07d7QB,en,[],[https://t.co/eoxT07d7QB],[],1603,2581,18087,[],[],https://twitter.com/JoeBiden/status/1322959086544211973,False,,0,,,,,,,,[],,,,,[],[],[]
4,1322927509260902401,1322927509260902401,2020-11-01 15:44:31+00:00,2020-11-01,15:44:31,0,939091,joebiden,Joe Biden,,The future of our planet is on the ballot. Vote: https://t.co/eoxT07d7QB,en,[],[https://t.co/eoxT07d7QB],[],2155,4513,43222,[],[],https://twitter.com/JoeBiden/status/1322927509260902401,False,,0,,,,,,,,[],,,,,[],[],[]


In [183]:
# Create clean tweet from original tweet text

# expand contractions
combined_df['tweet_clean'] = combined_df['tweet'].apply(lambda astr : contractions.fix(astr))

# remove punctuation
combined_df['tweet_clean'] = combined_df['tweet_clean'].apply(lambda astr : tweet_nopunct(astr))

# collapse multiple-whitespaces to one whitespace
combined_df['tweet_clean'] = combined_df['tweet_clean'].apply(lambda astr : ' '.join(astr.split()))

# all_my_tweets_df['tweet_clean'] = all_my_tweets_df['tweet'].apply(lambda astr : p.clean(astr))
# ERR: all_my_tweets_df['tweet_clean'] = all_my_tweets_df['tweet'].str.apply(p.clean(astr)

In [150]:
combined_df['tweet_clean'][:40]

0     this native american heritage day we give thanks to our indigenous communities and their ancestors as we celebrate their rich heritage and contributions let us commit to writing a new future together one built on a strong partnership and filled with opportunity for all      
1     i have said it many times i am more optimistic about americas future today than i was when i got elected to the united states senate as a year old                                                                                                                                 
2     i am speaking with members of the african american community in pittsburgh about the power of the vote and the future we can build together tune in                                                                                                                                
3     we can build a future where health care is a right we end the gun violence epidemic we combat climate change our government works for everyone vote 

In [184]:
combined_df.head()

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,language,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,thumbnail,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,emojis,smileys,numbers,tweet_clean
0,1332356759994970113,1332356759994970113,2020-11-27 16:13:00+00:00,2020-11-27,16:13:00,0,939091,joebiden,Joe Biden,,"This Native American Heritage Day, we give thanks to our Indigenous communities and their ancestors. As we celebrate their rich heritage and contributions, let’s commit to writing a new future together — one built on a strong partnership and filled with opportunity for all.",en,[],[],[],6351,21601,248367,[],[],https://twitter.com/JoeBiden/status/1332356759994970113,False,,0,,,,,,,,[],,,,,[],[],[],this native american heritage day we give thanks to our indigenous communities and their ancestors as we celebrate their rich heritage and contributions let us commit to writing a new future together one built on a strong partnership and filled with opportunity for all
1,1323473727447654400,1323473727447654400,2020-11-03 03:55:00+00:00,2020-11-03,03:55:00,0,939091,joebiden,Joe Biden,,I’ve said it many times: I’m more optimistic about America’s future today than I was when I got elected to the United States Senate as a 29-year-old.,en,[],[],[],4359,5375,79312,[],[],https://twitter.com/JoeBiden/status/1323473727447654400,False,,0,,,,,,,,[],,,,,[],[],[ 29],i have said it many times i am more optimistic about americas future today than i was when i got elected to the united states senate as a year old
2,1323391493885718528,1323391493885718528,2020-11-02 22:28:14+00:00,2020-11-02,22:28:14,0,939091,joebiden,Joe Biden,,I’m speaking with members of the African American community in Pittsburgh about the power of the vote — and the future we can build together. Tune in. https://t.co/1wFBiLoCWu,en,[],[https://t.co/1wFBiLoCWu],[],1818,1994,13823,[],[],https://twitter.com/JoeBiden/status/1323391493885718528,False,,0,,,,,,,,[],,,,,[],[],[],i am speaking with members of the african american community in pittsburgh about the power of the vote and the future we can build together tune in
3,1322959086544211973,1322959086544211973,2020-11-01 17:50:00+00:00,2020-11-01,17:50:00,0,939091,joebiden,Joe Biden,,We can build a future where: - Health care is a right - We end the gun violence epidemic - We combat climate change - Our government works for everyone Vote. https://t.co/eoxT07d7QB,en,[],[https://t.co/eoxT07d7QB],[],1603,2581,18087,[],[],https://twitter.com/JoeBiden/status/1322959086544211973,False,,0,,,,,,,,[],,,,,[],[],[],we can build a future where health care is a right we end the gun violence epidemic we combat climate change our government works for everyone vote
4,1322927509260902401,1322927509260902401,2020-11-01 15:44:31+00:00,2020-11-01,15:44:31,0,939091,joebiden,Joe Biden,,The future of our planet is on the ballot. Vote: https://t.co/eoxT07d7QB,en,[],[https://t.co/eoxT07d7QB],[],2155,4513,43222,[],[],https://twitter.com/JoeBiden/status/1322927509260902401,False,,0,,,,,,,,[],,,,,[],[],[],the future of our planet is on the ballot vote


# **3. Write Parsed & Cleaned Tweets to datafile.csv**

In [185]:
# Create unique output filename using current datetime stamp

file_name_cleaned = ('cleaned_' + file_name_all.split('.')[0] + '.csv') # 'tweets_combined_20201201-012404.csv'
print(file_name_cleaned)

combined_df.to_csv(file_name_cleaned)

cleaned_tweets_combined_20201201-012404.csv


In [186]:
!ls -al cleaned_*

-rw------- 1 root root 707199 Dec  1 06:20 cleaned_tweets_combined_20201201-012404.csv


In [187]:
!head -5 $file_name_cleaned

,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,tweet,language,mentions,urls,photos,replies_count,retweets_count,likes_count,hashtags,cashtags,link,retweet,quote_url,video,thumbnail,near,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest,emojis,smileys,numbers,tweet_clean
0,1332356759994970113,1332356759994970113,2020-11-27 16:13:00+00:00,2020-11-27,16:13:00,0,939091,joebiden,Joe Biden,,"This Native American Heritage Day, we give thanks to our Indigenous communities and their ancestors. As we celebrate their rich heritage and contributions, let’s commit to writing a new future together — one built on a strong partnership and filled with opportunity for all.",en,[],[],[],6351,21601,248367,[],[],https://twitter.com/JoeBiden/status/1332356759994970113,False,,0,,,,,,,,[],,,,,[],[],[],this native american heritage day we give thanks to our indigenous communities and their ancestors as we celebrate their rich heritage a