# FINAL DATASET CREATION

For the Final Dataset creation, I took a step-by-step and iterative approach. I started working on one persona at the time and understand its unique way of speaking. I then re-worked back all the personas and cleaned them in the same way (e.g. I have started with Biden and moved on to Kim Kardashian to go back again to re-clean Biden). 

This approach helped to understand the complexity of cleaning Twitter data, since each individual has a very specific way of communicating, uses a specific slang and has specificities to be taken into consideration when applying a similar cleaning across datasets. 

For each persona, this are overall the steps I have followed for the cleaning and features engineering and their explanation: 

* dropping duplicated Tweets with same URL (tweet_id)
* counting the number of words
* counting the total number of characters
* finding/counting/removing all the hashtags used
* finding/counting/removing all the @mentions used
* finding/counting/removing all the emojis used
* dropping all tweets considered as Retweets (retweet were identified as having indication of RT + @mention) **
* removing websites indication (anything that would start with a call to action towards a website with 'http')
* counting upper case words 
* setting the cleaned test to lower case 

In the end I have a text which is free from retweets, hashtags, emojis, at-mentions and it is ready to be manpulated with NLP techniques. 


** this was due to the fact that for example Biden and Trump were using the call to action to their audience by using the overall communication 'RT this post if you support me", which I decided to keep in the analysis as it is a way of doing politics. I did not clean cleaning retweets only having the indication RT but only the ones which were clearly indicating that the persona was retweeting someone else tweets (with the @mention). 


In [1]:
import pandas as pd
import regex as re
import emoji
import string
from collections import Counter

# Biden

In [2]:
# loading dataset
bid = pd.read_csv('biden_with_likes.csv')
print('Biden initial dataset size:', len(bid))

Biden initial dataset size: 6185


In [3]:
# drop duplicates with URL
bid = bid.drop_duplicates(subset = 'tweet_id')
print('Biden dataset size without URL duplicates:', len(bid))

Biden dataset size without URL duplicates: 6065


In [4]:
# counting the number of words used in a tweet
bid['total_words'] = bid['tweet'].str.split().str.len()

# counting the number of characters used in a tweet
bid['total_charact'] = bid['tweet'].apply(lambda x: len(x))

# creating a column with all the hashtags used
bid['hashtag'] = bid['tweet'].apply(lambda x: re.findall(r'\B#\w*[a-zA-Z]+\w*', x)) 

# creating a column with the number of hashtags used
bid['num_hashtags'] = bid['hashtag'].apply(lambda x: len(x))

# creating a column with @mentions only 
bid['mentions'] = bid['tweet'].apply(lambda x: re.findall(r'@[A-Za-z0-9]+', x))

# counting the @mentions only: 
bid['num_mentions'] = bid['mentions'].apply(lambda x: len(x))

In [5]:
# function to extract all emojis 
def extract_emojis(s):
    return ''.join(c for c in s if c in emoji.UNICODE_EMOJI )

In [6]:
# function to count the emojis 

def split_count(text):
    total_emoji = []
    data = re.findall(r'\X',text)
    flag = False
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):  
            total_emoji += [word] # total_emoji is a list of all emojis
    return Counter(total_emoji)

In [7]:
# extracting the emojis, counting them and making an overall count

bid['emojis'] = bid['tweet'].apply(lambda x: extract_emojis(x))
bid['counter_emojis'] = bid['tweet'].apply(lambda x: split_count(x))
bid['num_emojis'] = bid.counter_emojis.apply(lambda x : sum(x.values()))

In [8]:
bid[bid.num_emojis >1][['emojis', 'counter_emojis', 'num_emojis']]

Unnamed: 0,emojis,counter_emojis,num_emojis
999,🇺🇸🇺🇸🇺🇸,{'🇺🇸': 3},3
1027,🇺🇸🇺🇸🇺🇸,{'🇺🇸': 3},3
1634,✅✅✅,{'✅': 3},3
1657,✅✅✅,{'✅': 3},3
2209,🔴🔴🔴,{'🔴': 3},3
2226,✅✅✅,{'✅': 3},3
2268,✅✅✅,{'✅': 3},3
2562,✅✅✅,{'✅': 3},3


#### CLEANING 

In [9]:
# creating a column with RT and @mentions
# the reason is that user might asks his audience to retweet his tweets (e.g. "RT this if you..")

bid['RT'] = bid['tweet'].apply(lambda x: re.findall(r'RT @(\w+):', x))

#bid['RT'] = bid['tweet'].apply(lambda x: re.findall(r'^(RT)( @\w*)?[: ]', x))
#bid['RT'] = bid['tweet'].apply(lambda x: re.findall(r'(?<!RT\s)@\S+', x))


In [10]:
# checking the number of Retweets
print('total Retweets:',len(bid[(bid['RT'].str.len() > 0)]))

total Retweets: 27


In [11]:
# dropping what appear to clearly be Retweets (RT + @mentions)
bid = bid[(bid['RT'].str.len() == 0)]
print('Biden dataset size without retweets:', len(bid))

Biden dataset size without retweets: 6038


In [12]:
# cleanind tweets from hashtags
bid['cleaned'] = bid['tweet'].apply(lambda x: ' '.join(re.sub(r'\B#\w*[a-zA-Z]+\w*'," ", x).split()))

# cleanind tweets from @mentions
bid['cleaned'] = bid['cleaned'].apply(lambda x: ' '.join(re.sub(r'@[A-Za-z0-9]+'," ", x).split()))

# cleaning tweets from websites 
bid['cleaned'] = bid['cleaned'].apply(lambda x: ' '.join(re.sub(r'\b(?:https?://)?(?:(?i:[a-z]+\.)+)[^\s,]+\b'," ", x).split()))

In [13]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

# removing emojis 
bid['cleaned'] = bid['cleaned'].apply(lambda x: remove_emoji(x))

In [14]:
# removing punctuation
punctuations = '!"“”#$%&\'’‘()*+—-.–,–/:;<=>?@[\\]^_`{|}~™®©¹⁉'
bid['cleaned'] = bid['cleaned'].apply(lambda x: ''.join([i for i in x if not i in punctuations]))

In [15]:
# counting the number of words in caps lock in the cleaned text
# removing the "I" as it is counted as a capital word but it shouldn't
bid['num_upper_words'] = bid['cleaned'].apply(lambda x: sum([y.isupper() if y not in ['I', 'U', 'R'] else False for y in x.split()]))

In [16]:
# putting everything in lower character
bid['final'] = bid['cleaned'].str.lower()

In [17]:
# keeping only relevant columns
bid = bid[['date', 
           'likes',
           'retweets',
           'tweet',
           'tweet_id',
           'total_words',
           'total_charact', 
           'hashtag', 
           'num_hashtags', 
           'mentions', 
           'num_mentions', 
           'emojis', 
           'counter_emojis',
           'num_emojis',
           'num_upper_words',
           'cleaned',
           'final'
            ]]

### Kim Kardashian 

In [18]:
# loading the dataset
kim = pd.read_csv('kim_with_likes.csv')
print('Kim initial dataset size:', len(kim))

Kim initial dataset size: 29696


In [19]:
# drop duplicates with URL
kim = kim.drop_duplicates(subset = 'tweet_id')
print('Kim dataset size without URL duplicates:', len(kim))

Kim dataset size without URL duplicates: 29115


In [20]:
# counting the number of words used in a tweet
kim['total_words'] = kim['tweet'].str.split().str.len()

# counting the number of characters used in a tweet
kim['total_charact'] = kim['tweet'].apply(lambda x: len(x))

# creating a column with all the hashtags used
kim['hashtag'] = kim['tweet'].apply(lambda x: re.findall(r'\B#\w*[a-zA-Z]+\w*', x)) 

# creating a column with the number of hashtags used
kim['num_hashtags'] = kim['hashtag'].apply(lambda x: len(x))

# creating a column with @mentions only 
kim['mentions'] = kim['tweet'].apply(lambda x: re.findall(r'@[A-Za-z0-9]+', x))

# counting the @mentions only: 
kim['num_mentions'] = kim['mentions'].apply(lambda x: len(x))

In [21]:
# extracting the emojis, counting them and making an overall count
kim['emojis'] = kim['tweet'].apply(lambda x: extract_emojis(x))
kim['counter_emojis'] = kim['tweet'].apply(lambda x: split_count(x))
kim['num_emojis'] = kim.counter_emojis.apply(lambda x : sum(x.values()))

#### CLEANING

In [22]:
# creating a column with RT and @mentions
# the reason is that user might asks his audience to retweet his tweets (e.g. "RT this if you..")

kim['RT'] = kim['tweet'].apply(lambda x: re.findall(r'RT @(\w+):', x))

In [23]:
# checking how many retweets Kim did
print('total Retweets:',len(kim[(kim['RT'].str.len() > 0)]))

total Retweets: 1834


In [24]:
# dropping what appear to clearly be Retweets (RT + @mentions)
kim = kim[(kim['RT'].str.len() == 0)]
print('Kim dataset size without retweets:', len(kim))

Kim dataset size without retweets: 27281


In [25]:
# cleanind tweets from hashtags
kim['cleaned'] = kim['tweet'].apply(lambda x: ' '.join(re.sub(r'\B#\w*[a-zA-Z]+\w*'," ", x).split()))

# cleanind tweets from @mentions
#kim['cleaned'] = kim['cleaned'].apply(lambda x: ' '.join(re.sub(r'@(\w+):'," ", x).split()))
kim['cleaned'] = kim['cleaned'].apply(lambda x: ' '.join(re.sub(r'@[A-Za-z0-9]+'," ", x).split()))

# cleaning tweets from websites 
kim['cleaned'] = kim['cleaned'].apply(lambda x: ' '.join(re.sub(r'\b(?:https?://)?(?:(?i:[a-z]+\.)+)[^\s,]+\b'," ", x).split()))

# removing emojis 
kim['cleaned'] = kim['cleaned'].apply(lambda x: remove_emoji(x))

# removing punctuation
kim['cleaned'] = kim['cleaned'].apply(lambda x: ''.join([i for i in x if not i in punctuations]))

In [26]:
# counting the number of words in caps lock in the cleaned text
# removing the "I" as it is counted as a capital word but it shouldn't
kim['num_upper_words'] = kim['cleaned'].apply(lambda x: sum([y.isupper() if y not in ['I', 'U', 'R'] else False for y in x.split()]))

In [27]:
# putting everything in lower character
kim['final'] = kim['cleaned'].str.lower()

In [28]:
# keeping only relevant columns
kim = kim[['date', 
           'likes',
           'retweets',
           'tweet',
           'tweet_id',
           'total_words',
           'total_charact', 
           'hashtag', 
           'num_hashtags', 
           'mentions', 
           'num_mentions', 
           'emojis', 
           'counter_emojis',
           'num_emojis',
           'num_upper_words',
           'cleaned',
           'final'
            ]]

# Trump 

In [29]:
# loading dataset
trump = pd.read_csv('trump_with_likes.csv')
print('Trump initial dataset size:', len(trump))

Trump initial dataset size: 55090


In [30]:
# changing column names for ease of usage
trump.rename(columns={'id':'tweet_id'}, inplace=True)
trump.rename(columns={'text':'tweet'}, inplace=True)
trump.rename(columns={'favorites':'likes'}, inplace=True)

In [31]:
# drop duplicates with URL
trump = trump.drop_duplicates(subset = 'tweet_id')
print('Trump dataset size without URL duplicates:', len(trump))

Trump dataset size without URL duplicates: 55090


In [32]:
# deleted tweets 
trump.isDeleted.value_counts()

f    54050
t     1040
Name: isDeleted, dtype: int64

In [33]:
# keeping only the non-deleted data
trump = trump[trump.isDeleted == 'f']

In [34]:
# retweets 
trump.isRetweet.value_counts()

f    45121
t     8929
Name: isRetweet, dtype: int64

In [35]:
# keeping only the non-retweeted data
trump = trump[trump.isRetweet == 'f']

In [36]:
trump.shape
print('Trump after first basic cleaning (deleted & retweets) dataset size:', len(trump))

Trump after first basic cleaning (deleted & retweets) dataset size: 45121


In [37]:
# counting the number of words used in a tweet
trump['total_words'] = trump['tweet'].str.split().str.len()

# counting the number of characters used in a tweet
trump['total_charact'] = trump['tweet'].apply(lambda x: len(x))

# creating a column with all the hashtags used
trump['hashtag'] = trump['tweet'].apply(lambda x: re.findall(r'\B#\w*[a-zA-Z]+\w*', x)) 

# creating a column with the number of hashtags used
trump['num_hashtags'] = trump['hashtag'].apply(lambda x: len(x))

# creating a column with @mentions only 
trump['mentions'] = trump['tweet'].apply(lambda x: re.findall(r'@[A-Za-z0-9]+', x))

# counting the @mentions only: 
trump['num_mentions'] = trump['mentions'].apply(lambda x: len(x))

In [38]:
# extracting the emojis, counting them and making an overall count
trump['emojis'] = trump['tweet'].apply(lambda x: extract_emojis(x))
trump['counter_emojis'] = trump['tweet'].apply(lambda x: split_count(x))
trump['num_emojis'] = trump.counter_emojis.apply(lambda x : sum(x.values()))

#### CLEANING

In [39]:
# creating a column with RT and @mentions
# the reason is that user might asks his audience to retweet his tweets (e.g. "RT this if you..")

trump['RT'] = trump['tweet'].apply(lambda x: re.findall(r'RT @(\w+):', x))

In [40]:
# checking how many retweets Kim did
print('total Retweets:',len(trump[(trump['RT'].str.len() > 0)]))

total Retweets: 44


In [41]:
# dropping what appear to clearly be Retweets (RT + @mentions)
trump = trump[(trump['RT'].str.len() == 0)]
print('Trump dataset size without retweets:', len(trump))

Trump dataset size without retweets: 45077


In [42]:
# cleanind tweets from hashtags
trump['cleaned'] = trump['tweet'].apply(lambda x: ' '.join(re.sub(r'\B#\w*[a-zA-Z]+\w*'," ", x).split()))

# cleanind tweets from @mentions
trump['cleaned'] = trump['cleaned'].apply(lambda x: ' '.join(re.sub(r'@[A-Za-z0-9]+'," ", x).split()))

# cleaning tweets from websites 
trump['cleaned'] = trump['cleaned'].apply(lambda x: ' '.join(re.sub(r'\b(?:https?://)?(?:(?i:[a-z]+\.)+)[^\s,]+\b'," ", x).split()))

# removing emojis 
trump['cleaned'] = trump['cleaned'].apply(lambda x: remove_emoji(x))

# removing punctuation
trump['cleaned'] = trump['cleaned'].apply(lambda x: ''.join([i for i in x if not i in punctuations]))

In [43]:
# counting the number of words in caps lock in the cleaned text
# removing the "I" as it is counted as a capital word but it shouldn't
trump['num_upper_words'] = trump['cleaned'].apply(lambda x: sum([y.isupper() if y not in ['I', 'U', 'R'] else False for y in x.split()]))

In [44]:
# putting everything in lower character
trump['final'] = trump['cleaned'].str.lower()

In [45]:
# keeping only relevant columns
trump = trump[['date', 
           'likes',
           'retweets',
           'tweet',
           'tweet_id',
           'total_words',
           'total_charact', 
           'hashtag', 
           'num_hashtags', 
           'mentions', 
           'num_mentions', 
           'emojis', 
           'counter_emojis',
           'num_emojis',
           'num_upper_words',
           'cleaned',
           'final'
            ]]

## THE POPE 

In [46]:
# loading dataset
pope = pd.read_csv('pope_with_likes.csv')
print('Pope initial dataset size:', len(pope))

Pope initial dataset size: 3444


In [47]:
# drop duplicates with URL
pope = pope.drop_duplicates(subset = 'tweet_id')
print('Pope dataset size without URL duplicates:', len(pope))

Pope dataset size without URL duplicates: 2878


In [48]:
# counting the number of words used in a tweet
pope['total_words'] = pope['tweet'].str.split().str.len()

# counting the number of characters used in a tweet
pope['total_charact'] = pope['tweet'].apply(lambda x: len(x))

# creating a column with all the hashtags used
pope['hashtag'] = pope['tweet'].apply(lambda x: re.findall(r'\B#\w*[a-zA-Z]+\w*', x)) 

# creating a column with the number of hashtags used
pope['num_hashtags'] = pope['hashtag'].apply(lambda x: len(x))

# creating a column with @mentions only 
pope['mentions'] = pope['tweet'].apply(lambda x: re.findall(r'@[A-Za-z0-9]+', x))

# counting the @mentions only: 
pope['num_mentions'] = pope['mentions'].apply(lambda x: len(x))

In [49]:
# extracting the emojis, counting them and making an overall count
pope['emojis'] = pope['tweet'].apply(lambda x: extract_emojis(x))
pope['counter_emojis'] = pope['tweet'].apply(lambda x: split_count(x))
pope['num_emojis'] = pope.counter_emojis.apply(lambda x : sum(x.values()))

#### CLEANING

In [50]:
# creating a column with RT and @mentions
# the reason is that user might asks his audience to retweet his tweets (e.g. "RT this if you..")

pope['RT'] = pope['tweet'].apply(lambda x: re.findall(r'RT @(\w+):', x))

In [51]:
# checking how many retweets the Pope did
print('total Retweets:',len(pope[(pope['RT'].str.len() > 0)]))

total Retweets: 0


In [52]:
# dropping what appear to clearly be Retweets (RT + @mentions)
pope = pope[(pope['RT'].str.len() == 0)]
print('Pope dataset size without retweets:', len(pope))

Pope dataset size without retweets: 2878


In [53]:
# cleanind tweets from hashtags
pope['cleaned'] = pope['tweet'].apply(lambda x: ' '.join(re.sub(r'\B#\w*[a-zA-Z]+\w*'," ", x).split()))

# cleanind tweets from @mentions
pope['cleaned'] = pope['cleaned'].apply(lambda x: ' '.join(re.sub(r'@[A-Za-z0-9]+'," ", x).split()))

# cleaning tweets from websites 
pope['cleaned'] = pope['cleaned'].apply(lambda x: ' '.join(re.sub(r'\b(?:https?://)?(?:(?i:[a-z]+\.)+)[^\s,]+\b'," ", x).split()))

# removing emojis 
pope['cleaned'] = pope['cleaned'].apply(lambda x: remove_emoji(x))

# removing punctuation
pope['cleaned'] = pope['cleaned'].apply(lambda x: ''.join([i for i in x if not i in punctuations]))

In [54]:
# counting the number of words in caps lock in the cleaned text
# removing the "I" as it is counted as a capital word but it shouldn't
pope['num_upper_words'] = pope['cleaned'].apply(lambda x: sum([y.isupper() if y not in ['I', 'U', 'R'] else False for y in x.split()]))

In [55]:
# putting everything in lower character
pope['final'] = pope['cleaned'].str.lower()

In [56]:
# keeping only relevant columns
pope = pope[['date', 
           'likes',
           'retweets',
           'tweet',
           'tweet_id',
           'total_words',
           'total_charact', 
           'hashtag', 
           'num_hashtags', 
           'mentions', 
           'num_mentions', 
           'emojis', 
           'counter_emojis',
           'num_emojis',
           'num_upper_words',
           'cleaned',
           'final'
            ]]

## ELON MUSK

In [57]:
# loading dataset
elon = pd.read_csv('elon_with_likes.csv')
print('Elon initial dataset size:', len(elon))

Elon initial dataset size: 11512


In [58]:
# drop duplicates with URL
elon = elon.drop_duplicates(subset = 'tweet_id')
print('Elon dataset size without URL duplicates:', len(elon))

Elon dataset size without URL duplicates: 11288


In [59]:
# counting the number of words used in a tweet
elon['total_words'] = elon['tweet'].str.split().str.len()

# counting the number of characters used in a tweet
elon['total_charact'] = elon['tweet'].apply(lambda x: len(x))

# creating a column with all the hashtags used
elon['hashtag'] = elon['tweet'].apply(lambda x: re.findall(r'\B#\w*[a-zA-Z]+\w*', x)) 

# creating a column with the number of hashtags used
elon['num_hashtags'] = elon['hashtag'].apply(lambda x: len(x))

# creating a column with @mentions only 
elon['mentions'] = elon['tweet'].apply(lambda x: re.findall(r'@[A-Za-z0-9]+', x))

# counting the @mentions only: 
elon['num_mentions'] = elon['mentions'].apply(lambda x: len(x))

In [60]:
# extracting the emojis, counting them and making an overall count
elon['emojis'] = elon['tweet'].apply(lambda x: extract_emojis(x))
elon['counter_emojis'] = elon['tweet'].apply(lambda x: split_count(x))
elon['num_emojis'] = elon.counter_emojis.apply(lambda x : sum(x.values()))

#### CLEANING

In [61]:
# creating a column with RT and @mentions
# the reason is that user might asks his audience to retweet his tweets (e.g. "RT this if you..")

elon['RT'] = elon['tweet'].apply(lambda x: re.findall(r'RT @(\w+):', x))

In [62]:
# checking how many retweets the Pope did
print('total Retweets:',len(elon[(elon['RT'].str.len() > 0)]))

total Retweets: 0


In [63]:
# dropping what appear to clearly be Retweets (RT + @mentions)
elon = elon[(elon['RT'].str.len() == 0)]
print('Pope dataset size without retweets:', len(elon))

Pope dataset size without retweets: 11288


In [64]:
# cleanind tweets from hashtags
elon['cleaned'] = elon['tweet'].apply(lambda x: ' '.join(re.sub(r'\B#\w*[a-zA-Z]+\w*'," ", x).split()))

# cleanind tweets from @mentions
elon['cleaned'] = elon['cleaned'].apply(lambda x: ' '.join(re.sub(r'@[A-Za-z0-9]+'," ", x).split()))

# cleaning tweets from websites 
elon['cleaned'] = elon['cleaned'].apply(lambda x: ' '.join(re.sub(r'\b(?:https?://)?(?:(?i:[a-z]+\.)+)[^\s,]+\b'," ", x).split()))

# removing emojis 
elon['cleaned'] = elon['cleaned'].apply(lambda x: remove_emoji(x))

# removing punctuation
elon['cleaned'] = elon['cleaned'].apply(lambda x: ''.join([i for i in x if not i in punctuations]))

In [65]:
# counting the number of words in caps lock in the cleaned text
# removing the "I" as it is counted as a capital word but it shouldn't
elon['num_upper_words'] = elon['cleaned'].apply(lambda x: sum([y.isupper() if y not in ['I', 'U', 'R'] else False for y in x.split()]))

In [66]:
# putting everything in lower character
elon['final'] = elon['cleaned'].str.lower()

In [67]:
# keeping only relevant columns
elon = elon[['date', 
           'likes',
           'retweets',
           'tweet',
           'tweet_id',
           'total_words',
           'total_charact', 
           'hashtag', 
           'num_hashtags', 
           'mentions', 
           'num_mentions', 
           'emojis', 
           'counter_emojis',
           'num_emojis',
           'num_upper_words',
           'cleaned',
           'final'
            ]]

# MERGING DATASETS

In [68]:
# adding a column for the classification 
trump['persona'] = 'trump'
bid['persona'] = 'biden'
kim['persona'] = 'kim'
pope['persona'] = 'pope'
elon['persona'] = 'elon'

In [69]:
# shape of raw datasets
print("Trump dataset size:", trump.shape)
print("Biden dataset size:", bid.shape)
print("Kim dataset size:", kim.shape)
print("Pope dataset size:", pope.shape)
print("Elon dataset size:", elon.shape)

Trump dataset size: (45077, 18)
Biden dataset size: (6038, 18)
Kim dataset size: (27281, 18)
Pope dataset size: (2878, 18)
Elon dataset size: (11288, 18)


In [70]:
# appending all datasets into a new df 
total = trump.append(bid) 
total = total.append(kim)
total = total.append(pope)
total = total.append(elon)
print("Total Merged datasets size:", total.shape)

Total Merged datasets size: (92562, 18)


In [71]:
total.drop_duplicates(subset='tweet_id', inplace = True)
print("Total Merged datasets size after dropping URL duplicates:", total.shape)

Total Merged datasets size after dropping URL duplicates: (92559, 18)


In [72]:
total.tweet_id.is_unique

True

In [73]:
total = total.to_csv('final_dataset.csv', index=False)