# Johdanto Datatieteeseen 2021 -practical work
### *author: Ilpo Viertola*
During this work, I've used [this Jupyter-notebook](https://www.kaggle.com/lunamcbride24/covid19-tweet-truth-analysis) as reference.

In [73]:
# Normal stuff
import pandas as pd
import numpy as np

# For Twitter API
import tweepy
import ast

# Tweet preprocessing
import nltk
nltk.download('stopwords')  # Download stopwords (not downloaded if up to date)
nltk.download('wordnet')    # Download wordnet for lemmatizer (not downloaded if up to date)
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import html
import string

# Keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ilpoviertola/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ilpoviertola/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Reading the data
**Read the data in from [Kaggle csv-files](https://www.kaggle.com/elvinagammed/covid19-fake-news-dataset-nlp) and with Twitter API.**

In [74]:
csv_path = '/Users/ilpoviertola/OneDrive - TUNI.fi/Kurssimateriaaleja/JODA/datasets/covid19_fake_news'
# Data to train the model with
train_df = pd.read_csv(csv_path + '/Constraint_Train.csv')
train_df.head()
# Data to test the model with
test_df = pd.read_csv(csv_path + '/Constraint_Test.csv')
test_df.head()
# Data to validate the model with
val_df = pd.read_csv(csv_path + '/Constraint_Val.csv')
val_df.head()

Unnamed: 0,id,tweet,label
0,1,Chinese converting to Islam after realising th...,fake
1,2,11 out of 13 people (from the Diamond Princess...,fake
2,3,"COVID-19 Is Caused By A Bacterium, Not Virus A...",fake
3,4,Mike Pence in RNC speech praises Donald Trump’...,fake
4,5,6/10 Sky's @EdConwaySky explains the latest #C...,real


**Next Twitter data. First lets authenticate ourselves so we can use Twitter API.**

In [75]:
# Fetch Twitter API-keys from a local file.
key_file = open('twitter.key', 'r')
keys = ast.literal_eval(key_file.read())
key_file.close()

auth = tweepy.OAuthHandler(keys['API'], keys['API_secret'])
auth.set_access_token(keys['Access_token'], keys['Access_token_secret'])
api = tweepy.API(auth)

**Create DataFrame where the tweets are stored and fetch tweets.**

In [76]:
# Create DataFrame for tweets
tweet_df = pd.DataFrame(columns=['username', 'description', 'location', 'following', 'followers', 'totaltweets', 'retweetcount', 'text', 'hashtags'])

# Get tweets
hashtag = '#covid19'
d_since = '2021-04-07'
limit = 250
tweets = tweepy.Cursor(api.search, q=hashtag, lang='en', since=d_since, tweet_mode='extended').items(limit)
tweets_list = [tweet for tweet in tweets]

**Process tweets and add them to DataFrame. We'll exclude retweets.**

In [77]:
for tweet in tweets_list:
    # Data about tweets
    username = tweet.user.screen_name
    description = tweet.user.description
    location = tweet.user.location
    following = tweet.user.friends_count
    followers = tweet.user.followers_count
    totaltweets = tweet.user.statuses_count
    retweetcount = tweet.retweet_count
    hashtags = tweet.entities['hashtags']
    
    # Let's ignore all retweets
    if not tweet.retweeted and ('RT @' not in tweet.full_text):

        text = tweet.full_text
        hashtext = list()
        for j in range(0, len(hashtags)):
            hashtext.append(hashtags[j]['text'])
            
        # Lisätään data DataFrameen.
        ith_tweet = [username, description, location, following, followers, totaltweets, 
                    retweetcount, text, hashtext]
        tweet_df.loc[len(tweet_df)] = ith_tweet

tweet_df.head()

Unnamed: 0,username,description,location,following,followers,totaltweets,retweetcount,text,hashtags
0,johnlittle,Woke Leftie with long business career. Proud t...,Gadigal,5509,6617,22659,0,UK meets its #Covid19 targets vaccinating 32mi...,[Covid19]
1,ziya_al_ansar,"Law Student, Faculty of Law,\nAligarh Muslim U...","Aligarh, Uttar Pradesh.",1780,1836,990,0,Delhi CM Urges Centre To Cancel Board Exam.\nT...,[COVID19]
2,kamleshkhunti,Professor of Primary Care Diabetes & Vascular ...,"University of Leicester, UK",370,10576,15951,0,Great results from PRINCIPLE trial\n\nEarly tr...,[COVID19]
3,threadreaderapp,I'm a 🤖 to help you read threads more easily. ...,Wherever threads are written..,1292,460853,1793412,0,@shreyasbt Hi! please find the unroll here: Wh...,[COVID19]
4,JorgeLiboreiro,Now at @Euronews Brussels bureau. Interests in...,Bruxelles,344,401,8987,0,How the recent landmark ruling by the ECHR lay...,[COVID19]


## Data exploration
**Check column names**

In [78]:
print('Training & validation data\'s columns:')
print(train_df.columns.values)
print(val_df.columns.values)
print('Test data\'s columns:')
print(test_df.columns.values)
print('Twitter data\'s columns:')
print(tweet_df.columns.values)

Training & validation data's columns:
['id' 'tweet' 'label']
['id' 'tweet' 'label']
Test data's columns:
['id' 'tweet']
Twitter data's columns:
['username' 'description' 'location' 'following' 'followers' 'totaltweets'
 'retweetcount' 'text' 'hashtags']


**Columns are ok. Next check null-values and dataypes**

In [79]:
print('Null values in training data? ' + str(train_df.isnull().values.any()))
print('Null values in testing data? ' + str(test_df.isnull().values.any()))
print('Null values in validation data? ' + str(val_df.isnull().values.any()))
print('Null values in Twitter data? ' + str(tweet_df.isnull().values.any()))

Null values in training data? False
Null values in testing data? False
Null values in validation data? False
Null values in Twitter data? False


In [80]:
print('Datatypes for training data: \n' + str(train_df.dtypes) + '\n')
print('Datatypes for validation data: \n' + str(val_df.dtypes) + '\n')
print('Datatypes for testing data: \n' + str(test_df.dtypes) + '\n')
print('Datatypes for Twitter data: \n' + str(tweet_df.dtypes) + '\n')

Datatypes for training data: 
id        int64
tweet    object
label    object
dtype: object

Datatypes for validation data: 
id        int64
tweet    object
label    object
dtype: object

Datatypes for testing data: 
id        int64
tweet    object
dtype: object

Datatypes for Twitter data: 
username        object
description     object
location        object
following       object
followers       object
totaltweets     object
retweetcount    object
text            object
hashtags        object
dtype: object



**Twitter dataset needs some datatype modifications.**

In [81]:
tweet_df = tweet_df.astype({'following': 'int32', 'followers': 'int32', 
                            'totaltweets': 'int32', 'retweetcount': 'int32'})
print('New datatypes for Twitter data: \n' + str(tweet_df.dtypes) + '\n')

New datatypes for Twitter data: 
username        object
description     object
location        object
following        int32
followers        int32
totaltweets      int32
retweetcount     int32
text            object
hashtags        object
dtype: object



In [82]:
print('\nExample tweet from training data: ')
print(train_df['tweet'][5])
print('\nExample tweet from Twitter data: ')
print(tweet_df['text'][5])


Example tweet from training data: 
Covid Act Now found "on average each person in Illinois with COVID-19 is infecting 1.11 other people. Data shows that the infection growth rate has declined over time this factors in the stay-at-home order and other restrictions put in place." https://t.co/hhigDd24fE

Example tweet from Twitter data: 
Now that the 5KM has lifted. Here are the options Corkonians #Cork #notcork #5km #COVID19 #lockdown #lockdownlifting https://t.co/mFeRbaOxJ4


**Tweets typically contain links, other people's usernames, hashtags and emojis. These must be cleaned before training the model...**

In [83]:
print('Training data labels: \n', train_df['label'].value_counts())
print('\nValidation data labels: \n', val_df['label'].value_counts())

Training data labels: 
 real    3360
fake    3060
Name: label, dtype: int64

Validation data labels: 
 real    1120
fake    1020
Name: label, dtype: int64


**Datasets are balanced, meaning they contain approximately as much fake and real news. These values must be binarycoded in the future.**  
  
## Tweet preprocessing aka. feature extraction

In [84]:
puncs = string.punctuation
stopws = stopwords.words('english')
print(puncs)
print(stopws)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only

In [103]:
def tweet_cleaner(tweets):
    for i in range(0, len(tweets)):
        tweet = tweets[i]

        emoji_pattern = re.compile(pattern = '['
            u'\U0001F600-\U0001F64F'  # emoticons
            u'\U0001F300-\U0001F5FF'  # symbols & pictographs
            u'\U0001F680-\U0001F6FF'  # transport & map symbols
            u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
            u'\U00002702-\U000027B0'
            u'\U000024C2-\U0001F251'
            u'\U0001f926-\U0001f937'
            u'\U00010000-\U0010ffff'
            u'\u2640-\u2642'
            u'\u2600-\u2B55'
            u'\u200d'
            u'\u23cf'
            u'\u23e9'
            u'\u231a'
            u'\ufe0f'  # dingbats
            u'\u3030'
            ']+', flags = re.UNICODE)

        tweet = html.unescape(tweet)    # Remove leftover HTML elements
        tweet = re.sub(r'@\w+', ' ', tweet) # Remove mentions to other people
        tweet = re.sub(r'http\S+', ' ', tweet)  # Remove links
        tweet = emoji_pattern.sub(r'', tweet)   # Remove emojis
        
        tweet = ''.join([punc for punc in tweet if not punc in puncs])   # Remove punctuation
        tweet = tweet.lower()   # Lowercase text
    
        tweetWord = tweet.split()   # Split to words

        lemmatiser = WordNetLemmatizer()
        tweetWord = [lemmatiser.lemmatize(word, pos='v') for word in tweetWord] # Lemmatize words

        tweets[i] = ''.join([word + ' ' for word in tweetWord if not word in stopws]) # Exclude stopwords
        
    return tweets 

In [104]:
train_df['clean_tweet'] = tweet_cleaner(train_df['tweet'].copy())
train_df.head()

Unnamed: 0,id,tweet,label,clean_tweet,is_real,tweet_sequence
0,1,The CDC currently reports 99031 deaths. In gen...,real,cdc currently report 99031 deaths general disc...,1,"[93, 199, 6, 8639, 10, 659, 4638, 90, 403, 386..."
1,2,States reported 1121 deaths a small rise from ...,real,state report 1121 deaths small rise last tuesd...,1,"[7, 6, 8641, 10, 645, 132, 47, 1038, 2483, 7, ..."
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake,politically correct woman almost use pandemic ...,0,"[5933, 1150, 387, 471, 37, 23, 2484, 3889, 197..."
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real,indiafightscorona 1524 covid test laboratories...,1,"[19, 8642, 14, 3, 210, 16, 3009, 213, 42, 5934..."
4,5,Populous states can generate large case counts...,real,populous state generate large case count look ...,1,"[5935, 7, 2110, 440, 2, 403, 196, 5, 2, 100, 5..."


In [105]:
test_df['clean_tweet'] = tweet_cleaner(test_df['tweet'].copy())
test_df.head()

Unnamed: 0,id,tweet,clean_tweet,tweet_sequence
0,1,Our daily update is published. States reported...,daily update publish state report 734k test 39...,"[44, 18, 94, 7, 6, 16494, 3, 3658, 5, 2, 5349,..."
1,2,Alfalfa is the only cure for COVID-19.,alfalfa cure covid19,"[16495, 79, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0..."
2,3,President Trump Asked What He Would Do If He W...,president trump ask would catch coronavirus do...,"[84, 45, 322, 159, 623, 4, 304, 4, 0, 0, 0, 0,..."
3,4,States reported 630 deaths. We are still seein...,state report 630 deaths still see solid nation...,"[7, 6, 16496, 10, 124, 27, 2127, 145, 426, 90,..."
4,5,This is the sixth time a global health emergen...,sixth time global health emergency declare int...,"[2948, 33, 334, 15, 377, 913, 542, 15, 2387, 1..."


In [106]:
val_df['clean_tweet'] = tweet_cleaner(val_df['tweet'].copy())
val_df.head()

Unnamed: 0,id,tweet,label,clean_tweet,is_real,tweet_sequence
0,1,Chinese converting to Islam after realising th...,fake,chinese convert islam realise muslim affect co...,0,"[187, 1818, 2657, 3127, 701, 284, 4, 6875, 92,..."
1,2,11 out of 13 people (from the Diamond Princess...,fake,11 13 people diamond princess cruise ship inti...,0,"[449, 504, 8, 4023, 3508, 1323, 817, 14630, 3,..."
2,3,"COVID-19 Is Caused By A Bacterium, Not Virus A...",fake,covid19 cause bacterium virus treat aspirin,0,"[1, 117, 2372, 24, 180, 2350, 0, 0, 0, 0, 0, 0..."
3,4,Mike Pence in RNC speech praises Donald Trump’...,fake,mike pence rnc speech praise donald trump’s co...,0,"[1892, 1722, 6305, 2840, 1907, 197, 1352, 1, 5..."
4,5,6/10 Sky's @EdConwaySky explains the latest #C...,real,610 sky explain latest covid19 data government...,1,"[14632, 754, 492, 105, 1, 32, 78, 2246, 29, 4,..."


In [107]:
tweet_df['clean_tweet'] = tweet_cleaner(tweet_df['text'].copy())
tweet_df.head()

Unnamed: 0,username,description,location,following,followers,totaltweets,retweetcount,text,hashtags,clean_tweet,tweet_sequence
0,johnlittle,Woke Leftie with long business career. Proud t...,Gadigal,5509,6617,22659,0,UK meets its #Covid19 targets vaccinating 32mi...,[Covid19],uk meet covid19 target vaccinate 32million top...,"[173, 474, 1, 1278, 1129, 18455, 647, 1798, 20..."
1,ziya_al_ansar,"Law Student, Faculty of Law,\nAligarh Muslim U...","Aligarh, Uttar Pradesh.",1780,1836,990,0,Delhi CM Urges Centre To Cancel Board Exam.\nT...,[COVID19],delhi cm urge centre cancel board exam today u...,"[470, 1509, 609, 369, 1103, 1675, 3170, 17, 60..."
2,kamleshkhunti,Professor of Primary Care Diabetes & Vascular ...,"University of Leicester, UK",370,10576,15951,0,Great results from PRINCIPLE trial\n\nEarly tr...,[COVID19],great result principle trial early treatment r...,"[808, 135, 8628, 545, 349, 205, 4775, 37, 1403..."
3,threadreaderapp,I'm a 🤖 to help you read threads more easily. ...,Wherever threads are written..,1292,460853,1793412,0,@shreyasbt Hi! please find the unroll here: Wh...,[COVID19],hi please find unroll know far covid19 1 fresh...,"[1730, 191, 64, 18459, 108, 275, 1, 38, 1844, ..."
4,JorgeLiboreiro,Now at @Euronews Brussels bureau. Interests in...,Bruxelles,344,401,8987,0,How the recent landmark ruling by the ECHR lay...,[COVID19],recent landmark rule echr lay grind mandatory ...,"[367, 1923, 325, 18462, 1847, 1349, 1081, 1, 8..."


**Remove rows that have blank clean_tweets. (This could be the case if the tweet only contained e.g. a link)**

In [108]:
train_df['clean_tweet'].replace('', np.nan, inplace=True)
test_df['clean_tweet'].replace('', np.nan, inplace=True)
val_df['clean_tweet'].replace('', np.nan, inplace=True)
tweet_df['clean_tweet'].replace('', np.nan, inplace=True)

train_df.dropna(subset=['clean_tweet'], inplace=True)
test_df.dropna(subset=['clean_tweet'], inplace=True)
val_df.dropna(subset=['clean_tweet'], inplace=True)
tweet_df.dropna(subset=['clean_tweet'], inplace=True)

**Compare "dirty" and "clean" tweet**

In [109]:
print('Some dirty tweet:\n', train_df['tweet'][150])
print('\nClean version:\n', train_df['clean_tweet'][150])

Some dirty tweet:
 Thirty-nine GPs and specialists have written to the BMJ calling for action on long COVID. https://t.co/4Y5kGv3pF3 https://t.co/jTc1OucOmw

Clean version:
 thirtynine gps specialists write bmj call action long covid 


**Binarycode label-colum values to is_real-column in train_df and val_df. 0 = fake, 1 = real**

In [110]:
train_df['is_real'] = pd.get_dummies(train_df['label'])['real']
val_df['is_real'] = pd.get_dummies(val_df['label'])['real']
val_df.head()

Unnamed: 0,id,tweet,label,clean_tweet,is_real,tweet_sequence
0,1,Chinese converting to Islam after realising th...,fake,chinese convert islam realise muslim affect co...,0,"[187, 1818, 2657, 3127, 701, 284, 4, 6875, 92,..."
1,2,11 out of 13 people (from the Diamond Princess...,fake,11 13 people diamond princess cruise ship inti...,0,"[449, 504, 8, 4023, 3508, 1323, 817, 14630, 3,..."
2,3,"COVID-19 Is Caused By A Bacterium, Not Virus A...",fake,covid19 cause bacterium virus treat aspirin,0,"[1, 117, 2372, 24, 180, 2350, 0, 0, 0, 0, 0, 0..."
3,4,Mike Pence in RNC speech praises Donald Trump’...,fake,mike pence rnc speech praise donald trump’s co...,0,"[1892, 1722, 6305, 2840, 1907, 197, 1352, 1, 5..."
4,5,6/10 Sky's @EdConwaySky explains the latest #C...,real,610 sky explain latest covid19 data government...,1,"[14632, 754, 492, 105, 1, 32, 78, 2246, 29, 4,..."


**Export dataframes to csv-files**

In [111]:
if not os.path.exists(csv_path + '/modified_datasets'):
    os.makedirs(csv_path + '/modified_datasets')

train_df.to_csv(csv_path + '/modified_datasets/train.csv')
test_df.to_csv(csv_path + '/modified_datasets/test.csv')
val_df.to_csv(csv_path + '/modified_datasets/val.csv')
tweet_df.to_csv(csv_path + '/modified_datasets/tweets.csv')

### Tokenization and padding

In [112]:
train_clean = train_df['clean_tweet'].copy()
test_clean = test_df['clean_tweet'].copy()
val_clean = val_df['clean_tweet'].copy()
tweet_clean = tweet_df['clean_tweet'].copy()

all_clean = train_clean.append(val_clean, ignore_index=True).append(test_clean, ignore_index=True).append(tweet_clean, ignore_index=True)
print('Amount of tweets:', len(all_clean))

Amount of tweets: 10754


In [113]:
# This Tokenizer contains an integer value for every individual word that is in all_clean dataset.
# This is done because neural networks do not operate on words, so
# we need to map a word to a integer.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_clean)

In [114]:
# This function returns a list of integers. Each integer is mapped to a certain word.
# This list is the clean_tweet coded to integer values.
# Every sequence aka. list must be same length so we need to also pad shorter sequences with 0s.
def tokenize_tweet(tweets):
    texts = tokenizer.texts_to_sequences(tweets) # Convert clean_tweet to sequence of integers.
    texts = pad_sequences(texts, padding='post') # Pad the sequences with 0s so lenghts match.
    return texts

In [115]:
texts_train = tokenize_tweet(train_df["clean_tweet"].copy()) # Collect the tweet sequences
train_df["tweet_sequence"] = list(texts_train)
train_df.head()

Unnamed: 0,id,tweet,label,clean_tweet,is_real,tweet_sequence
0,1,The CDC currently reports 99031 deaths. In gen...,real,cdc currently report 99031 deaths general disc...,1,"[93, 199, 6, 8631, 10, 645, 4637, 90, 403, 386..."
1,2,States reported 1121 deaths a small rise from ...,real,state report 1121 deaths small rise last tuesd...,1,"[7, 6, 8633, 10, 646, 132, 47, 1038, 2482, 7, ..."
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake,politically correct woman almost use pandemic ...,0,"[5926, 1152, 387, 471, 37, 23, 2483, 3888, 197..."
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real,indiafightscorona 1524 covid test laboratories...,1,"[19, 8634, 14, 3, 210, 16, 3008, 213, 42, 5927..."
4,5,Populous states can generate large case counts...,real,populous state generate large case count look ...,1,"[5928, 7, 2110, 440, 2, 403, 196, 5, 2, 100, 5..."


In [116]:
texts_test = tokenize_tweet(test_df["clean_tweet"].copy())
test_df["tweet_sequence"] = list(texts_test)
test_df.head()

Unnamed: 0,id,tweet,clean_tweet,tweet_sequence
0,1,Our daily update is published. States reported...,daily update publish state report 734k test 39...,"[44, 18, 94, 7, 6, 16453, 3, 3657, 5, 2, 5344,..."
1,2,Alfalfa is the only cure for COVID-19.,alfalfa cure covid19,"[16454, 79, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0..."
2,3,President Trump Asked What He Would Do If He W...,president trump ask would catch coronavirus do...,"[84, 45, 322, 159, 623, 4, 304, 4, 0, 0, 0, 0,..."
3,4,States reported 630 deaths. We are still seein...,state report 630 deaths still see solid nation...,"[7, 6, 16455, 10, 125, 27, 2127, 145, 426, 90,..."
4,5,This is the sixth time a global health emergen...,sixth time global health emergency declare int...,"[2947, 33, 334, 15, 377, 913, 538, 15, 2386, 1..."


In [117]:
texts_val = tokenize_tweet(val_df["clean_tweet"].copy())
val_df["tweet_sequence"] = list(texts_val)
val_df.head()

Unnamed: 0,id,tweet,label,clean_tweet,is_real,tweet_sequence
0,1,Chinese converting to Islam after realising th...,fake,chinese convert islam realise muslim affect co...,0,"[187, 1818, 2657, 3126, 701, 284, 4, 6868, 92,..."
1,2,11 out of 13 people (from the Diamond Princess...,fake,11 13 people diamond princess cruise ship inti...,0,"[449, 504, 8, 4022, 3508, 1323, 817, 14595, 3,..."
2,3,"COVID-19 Is Caused By A Bacterium, Not Virus A...",fake,covid19 cause bacterium virus treat aspirin,0,"[1, 117, 2371, 24, 180, 2350, 0, 0, 0, 0, 0, 0..."
3,4,Mike Pence in RNC speech praises Donald Trump’...,fake,mike pence rnc speech praise donald trump’s co...,0,"[1893, 1722, 6298, 2839, 1908, 197, 1352, 1, 5..."
4,5,6/10 Sky's @EdConwaySky explains the latest #C...,real,610 sky explain latest covid19 data government...,1,"[14597, 754, 492, 105, 1, 32, 78, 2246, 29, 4,..."


In [118]:
texts_tweets = tokenize_tweet(tweet_df["clean_tweet"].copy())
tweet_df["tweet_sequence"] = list(texts_tweets)
tweet_df.head()

Unnamed: 0,username,description,location,following,followers,totaltweets,retweetcount,text,hashtags,clean_tweet,tweet_sequence
0,johnlittle,Woke Leftie with long business career. Proud t...,Gadigal,5509,6617,22659,0,UK meets its #Covid19 targets vaccinating 32mi...,[Covid19],uk meet covid19 target vaccinate 32million top...,"[173, 474, 1, 1278, 1129, 18405, 648, 1798, 20..."
1,ziya_al_ansar,"Law Student, Faculty of Law,\nAligarh Muslim U...","Aligarh, Uttar Pradesh.",1780,1836,990,0,Delhi CM Urges Centre To Cancel Board Exam.\nT...,[COVID19],delhi cm urge centre cancel board exam today u...,"[470, 1509, 609, 369, 1103, 1675, 3170, 17, 60..."
2,kamleshkhunti,Professor of Primary Care Diabetes & Vascular ...,"University of Leicester, UK",370,10576,15951,0,Great results from PRINCIPLE trial\n\nEarly tr...,[COVID19],great result principle trial early treatment r...,"[808, 135, 8620, 545, 349, 205, 4774, 37, 1403..."
3,threadreaderapp,I'm a 🤖 to help you read threads more easily. ...,Wherever threads are written..,1292,460853,1793412,0,@shreyasbt Hi! please find the unroll here: Wh...,[COVID19],hi please find unroll know far covid19 1 fresh...,"[1730, 191, 64, 18409, 108, 275, 1, 38, 1844, ..."
4,JorgeLiboreiro,Now at @Euronews Brussels bureau. Interests in...,Bruxelles,344,401,8987,0,How the recent landmark ruling by the ECHR lay...,[COVID19],recent landmark rule echr lay grind mandatory ...,"[367, 1924, 327, 18411, 1847, 1349, 1081, 1, 8..."


In [119]:
print('Some clean tweet:', train_df['clean_tweet'][300])
print('\nSame tweet\'s encoded sequence:', train_df['tweet_sequence'][300])
print('\nSequence transformed back to normal text:', tokenizer.sequences_to_texts([train_df['tweet_sequence'][300]]))

Some clean tweet: new numerous covid19 outbreaks recent cruise ship voyage extend previous sail order prevent spread covid19 among crew onboard 

Same tweet's encoded sequence: [   5 6082    1  642  367 1323  817 8920  894  748 4770  273  140   31
    1  365 2537 6083    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0  

## Describing the data
In this section, we will take a look inside the datasets we're going to use in this work.

In [120]:
print('The total number of words: ', len(tokenizer.word_index) + 1)
print(tokenizer.word_index)

