<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Instructions" data-toc-modified-id="Instructions-0">Instructions</a></span><ul class="toc-item"><li><span><a href="#Get-data-from-kaggle.com" data-toc-modified-id="Get-data-from-kaggle.com-0.1">Get data from kaggle.com</a></span></li><li><span><a href="#Load-a-dataframe" data-toc-modified-id="Load-a-dataframe-0.2">Load a dataframe</a></span></li><li><span><a href="#Basic-pre-processing" data-toc-modified-id="Basic-pre-processing-0.3">Basic pre-processing</a></span></li></ul></li></ul></div>

## Instructions

1. Load this data set from kaggle - kaggle datasets download -d gpreda/pfizer-vaccine-tweets
2. Determine the shape of the dataframe
3. Review the data types
4. Drop the id column
5. Check for null values
6. Perform the following pre-processing on the 'text' column. 
    - (new column1) change all text to lowercase
    - (new column2) use new column1 and remove contractions.  
    - (new column3) use new column2 and string the data back together
    - (new column4) use new column3 and tokenize into sentences
    - (new column5) use new column3, again, and tokenize into words   
    - (new column6) use new column5 and special characters
    - (new column7) use new column6 and remove stop words
    - (new column8) use new column7 and perform stemming
    - (new column9) use new column8 and perform lemmanization
    - add columns tweet length and tweet word count

### Get data from kaggle.com

In [1]:
#from google.colab import drive
#drive.mount('/content/drive')

In [2]:
#from google.colab import files

## Upload your kaggle json file (API Token)
#files.upload()

#!mkdir ~/.kaggle

#!cp kaggle.json ~/.kaggle/

#!chmod 600 ~/.kaggle/kaggle.json

In [3]:
#!kaggle datasets download -d gpreda/pfizer-vaccine-tweets

In [4]:
#!ls

In [5]:
#!mkdir data

#!unzip zip file name -d data

In [6]:
#!ls -l data/

### Load a dataframe

In [7]:
# Imports
import pandas as pd

# What other imports are required?
#!pip install contractions
#!pip install pyspellchecker

import contractions
import string
#import re

import nltk
#nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

#nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

#nltk.download('averaged_perceptron_tagger')
#nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

In [28]:
pfz = pd.read_csv('/Users/jimcody/Documents/2021Python/nlp/data/vaccination_tweets.csv')
pfz.head()

Unnamed: 0,id,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,retweets,favorites,is_retweet
0,1340539111971516416,Rachel Roh,"La Crescenta-Montrose, CA",Aggregator of Asian American news; scanning di...,2009-04-08 17:52:46,405,1692,3247,False,2020-12-20 06:06:44,Same folks said daikon paste could treat a cyt...,['PfizerBioNTech'],Twitter for Android,0,0,False
1,1338158543359250433,Albert Fong,"San Francisco, CA","Marketing dude, tech geek, heavy metal & '80s ...",2009-09-21 15:27:30,834,666,178,False,2020-12-13 16:27:13,While the world has been on the wrong side of ...,,Twitter Web App,1,1,False
2,1337858199140118533,eliüá±üáπüá™üá∫üëå,Your Bed,"heil, hydra üñê‚ò∫",2020-06-25 23:30:28,10,88,155,False,2020-12-12 20:33:45,#coronavirus #SputnikV #AstraZeneca #PfizerBio...,"['coronavirus', 'SputnikV', 'AstraZeneca', 'Pf...",Twitter for Android,0,0,False
3,1337855739918835717,Charles Adler,"Vancouver, BC - Canada","Hosting ""CharlesAdlerTonight"" Global News Radi...",2008-09-10 11:28:53,49165,3933,21853,True,2020-12-12 20:23:59,"Facts are immutable, Senator, even when you're...",,Twitter Web App,446,2129,False
4,1337854064604966912,Citizen News Channel,,Citizen News Channel bringing you an alternati...,2020-04-23 17:58:42,152,580,1473,False,2020-12-12 20:17:19,Explain to me again why we need a vaccine @Bor...,"['whereareallthesickpeople', 'PfizerBioNTech']",Twitter for iPhone,0,0,False


In [29]:
pfz.shape

(11003, 16)

In [30]:
pfz.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11003 entries, 0 to 11002
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                11003 non-null  int64 
 1   user_name         11003 non-null  object
 2   user_location     8734 non-null   object
 3   user_description  10323 non-null  object
 4   user_created      11003 non-null  object
 5   user_followers    11003 non-null  int64 
 6   user_friends      11003 non-null  int64 
 7   user_favourites   11003 non-null  int64 
 8   user_verified     11003 non-null  bool  
 9   date              11003 non-null  object
 10  text              11003 non-null  object
 11  hashtags          8426 non-null   object
 12  source            11002 non-null  object
 13  retweets          11003 non-null  int64 
 14  favorites         11003 non-null  int64 
 15  is_retweet        11003 non-null  bool  
dtypes: bool(2), int64(6), object(8)
memory usage: 1.2+ MB


### Basic pre-processing

In [32]:
# Drop columns
drop_columns = {'id'}
pfz = pfz.drop(columns = drop_columns)
#pfz.shape

In [33]:
# Change text to lowercase
pfz['lower'] = pfz['text'].str.lower()
#pfz.head()

In [34]:
# Remove contractions
pfz['remove_ctr'] = pfz['lower'].apply(lambda x: [contractions.fix(word) for word in x.split()])
#pfz.head()

In [35]:
# Change no_contract back to a string
pfz["review_new"] = [' '.join(map(str, l)) for l in pfz['remove_ctr']]
pfz.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,retweets,favorites,is_retweet,lower,remove_ctr,review_new
0,Rachel Roh,"La Crescenta-Montrose, CA",Aggregator of Asian American news; scanning di...,2009-04-08 17:52:46,405,1692,3247,False,2020-12-20 06:06:44,Same folks said daikon paste could treat a cyt...,['PfizerBioNTech'],Twitter for Android,0,0,False,same folks said daikon paste could treat a cyt...,"[same, folks, said, daikon, paste, could, trea...",same folks said daikon paste could treat a cyt...
1,Albert Fong,"San Francisco, CA","Marketing dude, tech geek, heavy metal & '80s ...",2009-09-21 15:27:30,834,666,178,False,2020-12-13 16:27:13,While the world has been on the wrong side of ...,,Twitter Web App,1,1,False,while the world has been on the wrong side of ...,"[while, the, world, has, been, on, the, wrong,...",while the world has been on the wrong side of ...
2,eliüá±üáπüá™üá∫üëå,Your Bed,"heil, hydra üñê‚ò∫",2020-06-25 23:30:28,10,88,155,False,2020-12-12 20:33:45,#coronavirus #SputnikV #AstraZeneca #PfizerBio...,"['coronavirus', 'SputnikV', 'AstraZeneca', 'Pf...",Twitter for Android,0,0,False,#coronavirus #sputnikv #astrazeneca #pfizerbio...,"[#coronavirus, #sputnikv, #astrazeneca, #pfize...",#coronavirus #sputnikv #astrazeneca #pfizerbio...
3,Charles Adler,"Vancouver, BC - Canada","Hosting ""CharlesAdlerTonight"" Global News Radi...",2008-09-10 11:28:53,49165,3933,21853,True,2020-12-12 20:23:59,"Facts are immutable, Senator, even when you're...",,Twitter Web App,446,2129,False,"facts are immutable, senator, even when you're...","[facts, are, immutable,, senator,, even, when,...","facts are immutable, senator, even when you ar..."
4,Citizen News Channel,,Citizen News Channel bringing you an alternati...,2020-04-23 17:58:42,152,580,1473,False,2020-12-12 20:17:19,Explain to me again why we need a vaccine @Bor...,"['whereareallthesickpeople', 'PfizerBioNTech']",Twitter for iPhone,0,0,False,explain to me again why we need a vaccine @bor...,"[explain, to, me, again, why, we, need, a, vac...",explain to me again why we need a vaccine @bor...


In [36]:
# Create tokenized sentences
pfz['tokenized_sent'] = pfz['review_new'].apply(sent_tokenize)
#pfz.head()

In [37]:
# Create tokenized words
pfz['tokenized_word'] = pfz['review_new'].apply(word_tokenize)
#pfz.head()

In [17]:
print(string.punctuation)  # String module

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [38]:
# Remove special characters  This uses the string module
punc = string.punctuation
pfz['no_punc'] = pfz['tokenized_word'].apply(lambda x: [word for word in x if word not in punc])
#pfz.head()

In [39]:
len(stop_words)

179

In [40]:
print(stop_words)

{'down', 'the', 've', 'these', 'by', 'was', 'up', "hadn't", 'shan', 'wouldn', 'needn', "should've", 'ours', 'himself', 'shouldn', 'what', 'and', 'have', 'than', "haven't", "you've", 'which', 'when', "doesn't", 'is', 'his', 'herself', 'your', 'those', 'themselves', 'won', "isn't", "you'll", 'i', 'it', 'while', "wasn't", 'having', 'she', 'didn', 't', 'to', 'isn', 'them', 'both', 'but', 'will', 'too', 'some', 'couldn', 's', 'haven', 'at', 'don', 'weren', "weren't", 'her', 'you', 'few', 'do', 'such', 'all', "aren't", 'only', 'against', 'or', 'myself', 'ain', 'they', 'hasn', 'my', 'are', 'out', 'over', 'of', 'between', 'he', 'yours', 'not', 'as', 'their', "won't", 'did', 'so', 'about', 'an', 'most', 'any', "you're", 'y', 'after', "mightn't", 'on', "shouldn't", "shan't", 'with', 'm', 'him', 'am', 'in', "needn't", 'because', 'until', 'yourselves', 'mightn', 'further', 'there', 'into', 'before', 'whom', "she's", 'its', 'itself', 'our', 'below', 'can', "hasn't", 'we', 'being', "wouldn't", 're',

In [41]:
pfz['no_stopwords'] = pfz['no_punc'].apply(lambda x: [word for word in x if word not in stop_words])
#pfz.head()

In [20]:
# We are going to fly to Europe.  There is a fly on the wall.

In [42]:
pfz['pos_tags'] = pfz['no_stopwords'].apply(nltk.tag.pos_tag)
#pfz.head()

In [43]:
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [44]:
pfz['wordnet_pos'] = pfz['pos_tags'].apply(lambda x: [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in x])
#pfz.head()

In [45]:
wnl = WordNetLemmatizer()
pfz['lemmatized'] = pfz['wordnet_pos'].apply(lambda x: [wnl.lemmatize(word, tag) for word, tag in x])
pfz.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,...,lower,remove_ctr,review_new,tokenized_sent,tokenized_word,no_punc,no_stopwords,pos_tags,wordnet_pos,lemmatized
0,Rachel Roh,"La Crescenta-Montrose, CA",Aggregator of Asian American news; scanning di...,2009-04-08 17:52:46,405,1692,3247,False,2020-12-20 06:06:44,Same folks said daikon paste could treat a cyt...,...,same folks said daikon paste could treat a cyt...,"[same, folks, said, daikon, paste, could, trea...",same folks said daikon paste could treat a cyt...,[same folks said daikon paste could treat a cy...,"[same, folks, said, daikon, paste, could, trea...","[same, folks, said, daikon, paste, could, trea...","[folks, said, daikon, paste, could, treat, cyt...","[(folks, NNS), (said, VBD), (daikon, JJ), (pas...","[(folks, n), (said, v), (daikon, a), (paste, n...","[folk, say, daikon, paste, could, treat, cytok..."
1,Albert Fong,"San Francisco, CA","Marketing dude, tech geek, heavy metal & '80s ...",2009-09-21 15:27:30,834,666,178,False,2020-12-13 16:27:13,While the world has been on the wrong side of ...,...,while the world has been on the wrong side of ...,"[while, the, world, has, been, on, the, wrong,...",while the world has been on the wrong side of ...,[while the world has been on the wrong side of...,"[while, the, world, has, been, on, the, wrong,...","[while, the, world, has, been, on, the, wrong,...","[world, wrong, side, history, year, hopefully,...","[(world, NN), (wrong, JJ), (side, NN), (histor...","[(world, n), (wrong, a), (side, n), (history, ...","[world, wrong, side, history, year, hopefully,..."
2,eliüá±üáπüá™üá∫üëå,Your Bed,"heil, hydra üñê‚ò∫",2020-06-25 23:30:28,10,88,155,False,2020-12-12 20:33:45,#coronavirus #SputnikV #AstraZeneca #PfizerBio...,...,#coronavirus #sputnikv #astrazeneca #pfizerbio...,"[#coronavirus, #sputnikv, #astrazeneca, #pfize...",#coronavirus #sputnikv #astrazeneca #pfizerbio...,[#coronavirus #sputnikv #astrazeneca #pfizerbi...,"[#, coronavirus, #, sputnikv, #, astrazeneca, ...","[coronavirus, sputnikv, astrazeneca, pfizerbio...","[coronavirus, sputnikv, astrazeneca, pfizerbio...","[(coronavirus, NN), (sputnikv, NN), (astrazene...","[(coronavirus, n), (sputnikv, n), (astrazeneca...","[coronavirus, sputnikv, astrazeneca, pfizerbio..."
3,Charles Adler,"Vancouver, BC - Canada","Hosting ""CharlesAdlerTonight"" Global News Radi...",2008-09-10 11:28:53,49165,3933,21853,True,2020-12-12 20:23:59,"Facts are immutable, Senator, even when you're...",...,"facts are immutable, senator, even when you're...","[facts, are, immutable,, senator,, even, when,...","facts are immutable, senator, even when you ar...","[facts are immutable, senator, even when you a...","[facts, are, immutable, ,, senator, ,, even, w...","[facts, are, immutable, senator, even, when, y...","[facts, immutable, senator, even, ethically, s...","[(facts, NNS), (immutable, JJ), (senator, NN),...","[(facts, n), (immutable, a), (senator, n), (ev...","[fact, immutable, senator, even, ethically, st..."
4,Citizen News Channel,,Citizen News Channel bringing you an alternati...,2020-04-23 17:58:42,152,580,1473,False,2020-12-12 20:17:19,Explain to me again why we need a vaccine @Bor...,...,explain to me again why we need a vaccine @bor...,"[explain, to, me, again, why, we, need, a, vac...",explain to me again why we need a vaccine @bor...,[explain to me again why we need a vaccine @bo...,"[explain, to, me, again, why, we, need, a, vac...","[explain, to, me, again, why, we, need, a, vac...","[explain, need, vaccine, borisjohnson, matthan...","[(explain, RB), (need, JJ), (vaccine, NN), (bo...","[(explain, r), (need, a), (vaccine, n), (boris...","[explain, need, vaccine, borisjohnson, matthan..."


In [46]:
pfz['review_len'] = pfz['text'].astype(str).apply(len)
pfz['word_count'] = pfz['text'].apply(lambda x: len(str(x).split()))

In [26]:
pfz.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,...,review_new,tokenized_sent,tokenized_word,no_punc,no_stopwords,pos_tags,wordnet_pos,lemmatized,review_len,word_count
0,Rachel Roh,"La Crescenta-Montrose, CA",Aggregator of Asian American news; scanning di...,2009-04-08 17:52:46,405,1692,3247,False,2020-12-20 06:06:44,Same folks said daikon paste could treat a cyt...,...,same folks said daikon paste could treat a cyt...,[same folks said daikon paste could treat a cy...,"[same, folks, said, daikon, paste, could, trea...","[same, folks, said, daikon, paste, could, trea...","[folks, said, daikon, paste, could, treat, cyt...","[(folks, NNS), (said, VBD), (daikon, JJ), (pas...","[(folks, n), (said, v), (daikon, a), (paste, n...","[folk, say, daikon, paste, could, treat, cytok...",97,12
1,Albert Fong,"San Francisco, CA","Marketing dude, tech geek, heavy metal & '80s ...",2009-09-21 15:27:30,834,666,178,False,2020-12-13 16:27:13,While the world has been on the wrong side of ...,...,while the world has been on the wrong side of ...,[while the world has been on the wrong side of...,"[while, the, world, has, been, on, the, wrong,...","[while, the, world, has, been, on, the, wrong,...","[world, wrong, side, history, year, hopefully,...","[(world, NN), (wrong, JJ), (side, NN), (histor...","[(world, n), (wrong, a), (side, n), (history, ...","[world, wrong, side, history, year, hopefully,...",140,21
2,eliüá±üáπüá™üá∫üëå,Your Bed,"heil, hydra üñê‚ò∫",2020-06-25 23:30:28,10,88,155,False,2020-12-12 20:33:45,#coronavirus #SputnikV #AstraZeneca #PfizerBio...,...,#coronavirus #sputnikv #astrazeneca #pfizerbio...,[#coronavirus #sputnikv #astrazeneca #pfizerbi...,"[#, coronavirus, #, sputnikv, #, astrazeneca, ...","[coronavirus, sputnikv, astrazeneca, pfizerbio...","[coronavirus, sputnikv, astrazeneca, pfizerbio...","[(coronavirus, NN), (sputnikv, NN), (astrazene...","[(coronavirus, n), (sputnikv, n), (astrazeneca...","[coronavirus, sputnikv, astrazeneca, pfizerbio...",140,15
3,Charles Adler,"Vancouver, BC - Canada","Hosting ""CharlesAdlerTonight"" Global News Radi...",2008-09-10 11:28:53,49165,3933,21853,True,2020-12-12 20:23:59,"Facts are immutable, Senator, even when you're...",...,"facts are immutable, senator, even when you ar...","[facts are immutable, senator, even when you a...","[facts, are, immutable, ,, senator, ,, even, w...","[facts, are, immutable, senator, even, when, y...","[facts, immutable, senator, even, ethically, s...","[(facts, NNS), (immutable, JJ), (senator, NN),...","[(facts, n), (immutable, a), (senator, n), (ev...","[fact, immutable, senator, even, ethically, st...",140,20
4,Citizen News Channel,,Citizen News Channel bringing you an alternati...,2020-04-23 17:58:42,152,580,1473,False,2020-12-12 20:17:19,Explain to me again why we need a vaccine @Bor...,...,explain to me again why we need a vaccine @bor...,[explain to me again why we need a vaccine @bo...,"[explain, to, me, again, why, we, need, a, vac...","[explain, to, me, again, why, we, need, a, vac...","[explain, need, vaccine, borisjohnson, matthan...","[(explain, RB), (need, JJ), (vaccine, NN), (bo...","[(explain, r), (need, a), (vaccine, n), (boris...","[explain, need, vaccine, borisjohnson, matthan...",135,14


In [27]:
pfz.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11003 entries, 0 to 11002
Data columns (total 27 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   user_name         11003 non-null  object
 1   user_location     8734 non-null   object
 2   user_description  10323 non-null  object
 3   user_created      11003 non-null  object
 4   user_followers    11003 non-null  int64 
 5   user_friends      11003 non-null  int64 
 6   user_favourites   11003 non-null  int64 
 7   user_verified     11003 non-null  bool  
 8   date              11003 non-null  object
 9   text              11003 non-null  object
 10  hashtags          8426 non-null   object
 11  source            11002 non-null  object
 12  retweets          11003 non-null  int64 
 13  favorites         11003 non-null  int64 
 14  is_retweet        11003 non-null  bool  
 15  lower             11003 non-null  object
 16  remove_ctr        11003 non-null  object
 17  review_new  