<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Instructions" data-toc-modified-id="Instructions-0">Instructions</a></span><ul class="toc-item"><li><span><a href="#Get-data-from-kaggle.com" data-toc-modified-id="Get-data-from-kaggle.com-0.1">Get data from kaggle.com</a></span></li><li><span><a href="#Load-a-dataframe" data-toc-modified-id="Load-a-dataframe-0.2">Load a dataframe</a></span></li><li><span><a href="#Basic-pre-processing" data-toc-modified-id="Basic-pre-processing-0.3">Basic pre-processing</a></span></li></ul></li></ul></div>

## Instructions

1. Load this data set from kaggle - kaggle datasets download -d gpreda/pfizer-vaccine-tweets
2. Determine the shape of the dataframe
3. Review the data types
4. Drop the id column
5. Check for null values
6. Perform the following pre-processing on the 'text' column. 
    - (new column1) change all text to lowercase
    - (new column2) use new column1 and remove contractions.  
    - (new column3) use new column2 and string the data back together
    - (new column4) use new column3 and tokenize into sentences
    - (new column5) use new column3, again, and tokenize into words   
    - (new column6) use new column5 and special characters
    - (new column7) use new column6 and remove stop words
    - (new column8) use new column7 and perform stemming
    - (new column9) use new column8 and perform lemmanization
    - add columns tweet length and tweet word count

### Get data from kaggle.com

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from google.colab import files

## Upload your kaggle json file (API Token)
files.upload()

!mkdir ~/.kaggle

!cp kaggle.json ~/.kaggle/

!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d gpreda/pfizer-vaccine-tweets

In [None]:
'''!mkdir data
!unzip zip file name -d data'''

# Save it once to your Google Drive
!unzip pfizer-vaccine-tweets.zip -d /content/drive/MyDrive/NLP_data

### Load a dataframe

In [None]:
# What other installs are required for CoLab?

!pip install contractions


# Imports
import pandas as pd
import numpy
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import contractions
import string
import re

import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize


from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))


from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)

from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
from textblob import TextBlob




In [None]:
pfz = pd.read_csv('/content/drive/MyDrive/NLP_data/vaccination_tweets.csv')

### Basic pre-processing

In [None]:
# Drop columns
drop_columns = {'id'}
pfz = pfz.drop(columns = drop_columns)

In [None]:
# Change text to lowercase
pfz['lower'] = pfz['text'].str.lower()

In [None]:
# Remove contractions
pfz['remove_ctr'] = pfz['lower'].apply(lambda x: [contractions.fix(word) for word in x.split()])

In [None]:
# Change no_contract back to a string
pfz["review_new"] = [' '.join(map(str, l)) for l in pfz['remove_ctr']]

In [None]:
# Create tokenized sentences
pfz['tokenized_sent'] = pfz['review_new'].apply(sent_tokenize)

In [None]:
# Create tokenized words
pfz['tokenized_word'] = pfz['review_new'].apply(word_tokenize)

In [None]:
# Remove special characters  This uses the string module
punc = string.punctuation
pfz['no_punc'] = pfz['tokenized_word'].apply(lambda x: [word for word in x if word not in punc])

In [None]:
pfz['no_stopwords'] = pfz['no_punc'].apply(lambda x: [word for word in x if word not in stop_words])

In [None]:
pfz['pos_tags'] = pfz['no_stopwords'].apply(nltk.tag.pos_tag)

In [None]:
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [None]:
pfz['wordnet_pos'] = pfz['pos_tags'].apply(lambda x: [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in x])

In [None]:
wnl = WordNetLemmatizer()
pfz['lemmatized'] = pfz['wordnet_pos'].apply(lambda x: [wnl.lemmatize(word, tag) for word, tag in x])

In [None]:
pfz['review_len'] = pfz['text'].astype(str).apply(len)
pfz['word_count'] = pfz['text'].apply(lambda x: len(str(x).split()))

In [None]:
# New column for sentiment polarity. Two new columns for lengths of the review and word count.
pfz['polarity'] = pfz['text'].map(lambda text: TextBlob(text).sentiment.polarity)
pfz['review_len'] = pfz['text'].astype(str).apply(len)
pfz['word_count'] = pfz['text'].apply(lambda x: len(str(x).split()))
pfz['subjectivity'] = pfz['text'].map(lambda text: TextBlob(text).sentiment.subjectivity)

In [None]:
pfz.to_csv('/content/drive/MyDrive/NLP_data/pfz.csv')