<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Instructions" data-toc-modified-id="Instructions-0">Instructions</a></span><ul class="toc-item"><li><span><a href="#Get-data-from-kaggle.com" data-toc-modified-id="Get-data-from-kaggle.com-0.1">Get data from kaggle.com</a></span></li><li><span><a href="#Load-a-dataframe" data-toc-modified-id="Load-a-dataframe-0.2">Load a dataframe</a></span></li><li><span><a href="#Basic-pre-processing" data-toc-modified-id="Basic-pre-processing-0.3">Basic pre-processing</a></span></li></ul></li></ul></div>

## Instructions

1. Load this data set from kaggle - kaggle datasets download -d gpreda/pfizer-vaccine-tweets
2. Determine the shape of the dataframe
3. Review the data types
4. Drop the id column
5. Check for null values
6. Perform the following pre-processing on the 'text' column. 
    - (new column1) change all text to lowercase
    - (new column2) use new column1 and remove contractions.  
    - (new column3) use new column2 and string the data back together
    - (new column4) use new column3 and tokenize into sentences
    - (new column5) use new column3, again, and tokenize into words   
    - (new column6) use new column5 and special characters
    - (new column7) use new column6 and remove stop words
    - (new column8) use new column7 and perform stemming
    - (new column9) use new column8 and perform lemmanization
    - add columns tweet length and tweet word count

### Get data from kaggle.com

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

In [None]:
#from google.colab import files

## Upload your kaggle json file (API Token)
#files.upload()

#!mkdir ~/.kaggle

#!cp kaggle.json ~/.kaggle/

#!chmod 600 ~/.kaggle/kaggle.json

In [None]:
#!kaggle datasets download -d gpreda/pfizer-vaccine-tweets

In [None]:
#!ls

In [None]:
#!mkdir data

#!unzip zip file name -d data

In [None]:
#!ls -l data/

### Load a dataframe

In [None]:
# Imports
import pandas as pd

# What other imports are required?
#!pip install contractions
#!pip install pyspellchecker

import contractions
import string
import re

import nltk
#nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

#nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

#nltk.download('averaged_perceptron_tagger')
#nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

In [None]:
pfz = pd.read_csv('/Users/jimcody/Documents/2021Python/nlp/data/vaccination_tweets.csv')
#pfz.head()

In [None]:
#pfz.shape

In [None]:
#pfz.info()

### Basic pre-processing

In [None]:
# Drop columns
drop_columns = {'id'}
pfz = pfz.drop(columns = drop_columns)
#pfz.shape

In [None]:
# Change text to lowercase
pfz['lower'] = pfz['text'].str.lower()
#pfz.head()

In [None]:
# Remove contractions
pfz['remove_ctr'] = pfz['lower'].apply(lambda x: [contractions.fix(word) for word in x.split()])
#pfz.head()

In [None]:
# Change no_contract back to a string
pfz["review_new"] = [' '.join(map(str, l)) for l in pfz['remove_ctr']]
#pfz.head()

In [None]:
# Create tokenized sentences
pfz['tokenized_sent'] = pfz['review_new'].apply(sent_tokenize)
#pfz.head()

In [None]:
# Create tokenized words
pfz['tokenized_word'] = pfz['review_new'].apply(word_tokenize)
#pfz.head()

In [None]:
print(string.punctuation)

In [None]:
# Remove special characters
punc = string.punctuation
pfz['no_punc'] = pfz['tokenized_word'].apply(lambda x: [word for word in x if word not in punc])
#pfz.head()

In [None]:
pfz['no_stopwords'] = pfz['no_punc'].apply(lambda x: [word for word in x if word not in stop_words])
#pfz.head()

In [None]:
pfz['pos_tags'] = pfz['no_stopwords'].apply(nltk.tag.pos_tag)
#pfz.head()

In [None]:
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [None]:
pfz['wordnet_pos'] = pfz['pos_tags'].apply(lambda x: [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in x])
#pfz.head()

In [None]:
wnl = WordNetLemmatizer()
pfz['lemmatized'] = pfz['wordnet_pos'].apply(lambda x: [wnl.lemmatize(word, tag) for word, tag in x])
#pfz.head()

In [None]:
pfz['review_len'] = pfz['text'].astype(str).apply(len)
pfz['word_count'] = pfz['text'].apply(lambda x: len(str(x).split()))

In [27]:
pfz.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,...,review_new,tokenized_sent,tokenized_word,no_punc,no_stopwords,pos_tags,wordnet_pos,lemmatized,review_len,word_count
0,Rachel Roh,"La Crescenta-Montrose, CA",Aggregator of Asian American news; scanning di...,2009-04-08 17:52:46,405,1692,3247,False,2020-12-20 06:06:44,Same folks said daikon paste could treat a cyt...,...,same folks said daikon paste could treat a cyt...,[same folks said daikon paste could treat a cy...,"[same, folks, said, daikon, paste, could, trea...","[same, folks, said, daikon, paste, could, trea...","[folks, said, daikon, paste, could, treat, cyt...","[(folks, NNS), (said, VBD), (daikon, JJ), (pas...","[(folks, n), (said, v), (daikon, a), (paste, n...","[folk, say, daikon, paste, could, treat, cytok...",97,12
1,Albert Fong,"San Francisco, CA","Marketing dude, tech geek, heavy metal & '80s ...",2009-09-21 15:27:30,834,666,178,False,2020-12-13 16:27:13,While the world has been on the wrong side of ...,...,while the world has been on the wrong side of ...,[while the world has been on the wrong side of...,"[while, the, world, has, been, on, the, wrong,...","[while, the, world, has, been, on, the, wrong,...","[world, wrong, side, history, year, hopefully,...","[(world, NN), (wrong, JJ), (side, NN), (histor...","[(world, n), (wrong, a), (side, n), (history, ...","[world, wrong, side, history, year, hopefully,...",140,21
2,eli🇱🇹🇪🇺👌,Your Bed,"heil, hydra 🖐☺",2020-06-25 23:30:28,10,88,155,False,2020-12-12 20:33:45,#coronavirus #SputnikV #AstraZeneca #PfizerBio...,...,#coronavirus #sputnikv #astrazeneca #pfizerbio...,[#coronavirus #sputnikv #astrazeneca #pfizerbi...,"[#, coronavirus, #, sputnikv, #, astrazeneca, ...","[coronavirus, sputnikv, astrazeneca, pfizerbio...","[coronavirus, sputnikv, astrazeneca, pfizerbio...","[(coronavirus, NN), (sputnikv, NN), (astrazene...","[(coronavirus, n), (sputnikv, n), (astrazeneca...","[coronavirus, sputnikv, astrazeneca, pfizerbio...",140,15
3,Charles Adler,"Vancouver, BC - Canada","Hosting ""CharlesAdlerTonight"" Global News Radi...",2008-09-10 11:28:53,49165,3933,21853,True,2020-12-12 20:23:59,"Facts are immutable, Senator, even when you're...",...,"facts are immutable, senator, even when you ar...","[facts are immutable, senator, even when you a...","[facts, are, immutable, ,, senator, ,, even, w...","[facts, are, immutable, senator, even, when, y...","[facts, immutable, senator, even, ethically, s...","[(facts, NNS), (immutable, JJ), (senator, NN),...","[(facts, n), (immutable, a), (senator, n), (ev...","[fact, immutable, senator, even, ethically, st...",140,20
4,Citizen News Channel,,Citizen News Channel bringing you an alternati...,2020-04-23 17:58:42,152,580,1473,False,2020-12-12 20:17:19,Explain to me again why we need a vaccine @Bor...,...,explain to me again why we need a vaccine @bor...,[explain to me again why we need a vaccine @bo...,"[explain, to, me, again, why, we, need, a, vac...","[explain, to, me, again, why, we, need, a, vac...","[explain, need, vaccine, borisjohnson, matthan...","[(explain, RB), (need, JJ), (vaccine, NN), (bo...","[(explain, r), (need, a), (vaccine, n), (boris...","[explain, need, vaccine, borisjohnson, matthan...",135,14


In [26]:
pfz.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11003 entries, 0 to 11002
Data columns (total 27 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   user_name         11003 non-null  object
 1   user_location     8734 non-null   object
 2   user_description  10323 non-null  object
 3   user_created      11003 non-null  object
 4   user_followers    11003 non-null  int64 
 5   user_friends      11003 non-null  int64 
 6   user_favourites   11003 non-null  int64 
 7   user_verified     11003 non-null  bool  
 8   date              11003 non-null  object
 9   text              11003 non-null  object
 10  hashtags          8426 non-null   object
 11  source            11002 non-null  object
 12  retweets          11003 non-null  int64 
 13  favorites         11003 non-null  int64 
 14  is_retweet        11003 non-null  bool  
 15  lower             11003 non-null  object
 16  remove_ctr        11003 non-null  object
 17  review_new  