# Introduction to NLP: Assignment 1 

### Description of Assignment 1

This assignment relates to Theme Basic NLP of the introduction to NLP (courses Deskriptiv analytik / Machine learning for descriptive problems), and will focus on basic text processing methods on Python.

The assignment is handed in as a Jupyter notebook containing the code used to solve the problem, output presenting the results, and, most importantly, notes that present the students' conclusions and answer questions posed in the assignment. 

**Assignment steps/Questions:**

Download and extract data package from here [http://dl.turkunlp.org/intro-to-nlp.tar.gz](http://dl.turkunlp.org/intro-to-nlp.tar.gz). Gzipped file intro-to-nlp/english-tweets-sample.jsonl.gz includes 10,000 English tweets downloaded from the Twiter API. The file is compressed and in JSON Lines format ([http://jsonlines.org/](http://jsonlines.org/)), i.e. one json per line.

Note: If processing the whole fiel takes too long, it's okay to read just a subset of the data, for example only 2,000 tweets...

1. Read tweets in Python 
2. Extract the actual text fields from the tweet jsons, discard all metadata at this point. Note that sometimes the text may be truncated to fit the old character limit. In these cases, is it possible to get the full text?
3. Segment each tweet using both UDPipe machine learned model (can be found from the same data package) and a heuristic method. What can you tell about the segmentation performance on tweets when manually inspecting few examples?
4. Count a word frequency list (how many times each word appears and how many unique words there are). Which are the most commmon words appearing in the data? What kind of words these are?
5. Calculate **idf** weight for each word appearing in the data (one tweet = one document), and print top 20 words with lowest and highest idf values. Why **tf** not really matter when processing tweets?
6. Find duplicate or near duplicate tweets (in terms of text field only) in the data using any method you see fit. What kind of techniques you considered using and/or tested, and how many duplicate or near duplicate did you find?

### Import libraries

In [1]:
import json
import gzip
import ufal.udpipe as udpipe
import re
import pandas as pd
import numpy as np
from collections import Counter
import nltk
nltk.download('stopwords') # download the stopwords dataset
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Niklas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 1. Read tweets in Python 

Let's start by reading in the tweets from the gzipped jsonline file. 

In [2]:
# Read data from gzip file and decode the json to a list 

#import json
#import gzip

data = []
with gzip.open('english-tweets-sample.jsonl.gz', 'rb')as f:
    for line in f:        
        data.append(json.loads(line))

In [3]:
# Show information of the data
print("Number of documents:", len(data))
print("Data type:", type(data))
print("First item type:", type(data[0]))
print("First item:", data[0])

Number of documents: 10000
Data type: <class 'list'>
First item type: <class 'dict'>
First item: {'created_at': 'Tue Dec 26 14:16:22 +0000 2017', 'id': 945659557480611840, 'id_str': '945659557480611840', 'text': 'Check out my class in #GranblueFantasy! https://t.co/pAvXn8diJr', 'display_text_range': [0, 39], 'source': '<a href="http://granbluefantasy.jp/" rel="nofollow">グランブルー ファンタジー</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 883980236655779840, 'id_str': '883980236655779840', 'name': 'Pc Kwok', 'screen_name': 'jensenpck', 'location': None, 'url': None, 'description': None, 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 0, 'friends_count': 1, 'listed_count': 0, 'favourites_count': 0, 'statuses_count': 42, 'created_at': 'Sun Jul 09 09:24:46 +0000 2017', 'utc_offset': None, 'time_zone': None, '

## 2. Extract the actual text fields from the tweet jsons

**Extract the actual text fields from the tweet jsons, discard all metadata at this point. Note that sometimes the text may be truncated to fit the old character limit. In these cases, is it possible to get the full text?**

Let's start by extracting only the text fields from the tweets. Some of the tweets are truncated and yes it is possible to extract the full text from truncated tweets. 

There are two kind of truncated texts:
1. Original tweets that are truncated. (Value of truncated == True)
2. Retweeted tweets that are truncated. 

The retweeted tweets do not say from the truncated status if they are truncated. From the retweeted_status it is possible to check if the original tweet has been truncated and to extract the full_text tweet.

In [4]:
# Extract the actual text field an ddiscard meta data
documents = []

for d in data:
    # Check if retweeted
    if("retweeted_status" in d):
        # Check if retweet is truncated
        if(d["retweeted_status"]["truncated"] == True):
            documents.append(d["retweeted_status"]["extended_tweet"]["full_text"])    
        else:
            documents.append(d["retweeted_status"]["text"])
    # Check if original tweet is truncated                
    elif(d["truncated"] == True): 
        documents.append(d["extended_tweet"]["full_text"])
    # Default case
    else:
        documents.append(d["text"])

In [5]:
# Show information of the data
print("Number of documents:", len(documents))
print("Documents type:", type(documents))
print("First item type:", type(documents[0]))
print("First item:", documents[0])
print("Second item:", documents[1])
print("Third item:", documents[2])

Number of documents: 10000
Documents type: <class 'list'>
First item type: <class 'str'>
First item: Check out my class in #GranblueFantasy! https://t.co/pAvXn8diJr
Second item: Extending a big Thank You to our Community Partner all over the world! https://t.co/cu7on7g1si
Third item: Blueberry 🍨 https://t.co/2gzHAFWYJY


## 3. Segment each tweet using both UDPipe machine learned model and a heuristic method

**Segment each tweet using both UDPipe machine learned model (can be found from the same data package) and a heuristic method. What can you tell about the segmentation performance on tweets when manually inspecting few examples?**

In [6]:
# Segmentation with UDPipe machine learned model

#import ufal.udpipe as udpipe

model = udpipe.Model.load("en.segmenter.udpipe")
pipeline = udpipe.Pipeline(model,"tokenize","none","none","horizontal")

segmented_documents = []

for d in documents:
    segmented_documents.append(pipeline.process(d))


In [7]:
# Show data
df = pd.DataFrame({'Tweets':segmented_documents})
df.head()

Unnamed: 0,Tweets
0,Check out my class in # GranblueFantasy !\nhtt...
1,Extending a big Thank\nYou to our Community Pa...
2,Blueberry 🍨 https://t.co/2gzHAFWYJY\n
3,Bad day ☹️®️\n
4,@prologve_ @BTS_ARMY @BTS_twt I 'm Chim tho\n


In [8]:
# Segmentation with a heuristic model

#import re

heuristic_seg_docs = []

for d in documents:
    segmented = re.sub(r'https?:\/\/.*\/\w*', '', d) #remove links
    segmented = re.sub(r'@[^\s]+', '', segmented) #remove twitter handles
    segmented = re.sub(r'#[^\s]+', '', segmented) #remove hashtags
    segmented = re.sub(r'(&.+;)', '', segmented) #remove &amp;  &gt;  ...
    segmented = re.sub(r'[\U00010000-\U0010ffff]', '', segmented) #remove some emojis
    segmented = segmented.lower() # Make every char lower for easier tokenization
    segmented = re.sub(r'rt', '', segmented) #remove retweets
    segmented = re.sub(r'([. , ! ? : ; # @ & - € $]+)', r' \1 ', segmented) # replace 
    segmented = re.sub(r"(’t)", r" \1", segmented) # clitics n’t
    segmented = re.sub(r"('t)", r" \1", segmented) # clitics n't
    segmented = re.sub(r"(’s)", r" \1", segmented) # clitics ’s
    segmented = re.sub(r"('s)", r" \1", segmented) # clitics 's
    segmented = re.sub(r"(’re)", r" \1", segmented) # clitics ’re
    segmented = re.sub(r"('re)", r" \1", segmented) # clitics 're
    segmented = re.sub(r"(’m)", r" \1", segmented) # clitics ’m
    segmented = re.sub(r"('m)", r" \1", segmented) # clitics 'm
    segmented = re.sub(r"(’ve)", r" \1", segmented) # clitics ’ve
    segmented = re.sub(r"('ve)", r" \1", segmented) # clitics ve
    segmented = re.sub(r"(’d)", r" \1", segmented) # clitics ’d
    segmented = re.sub(r"('d)", r" \1", segmented) # clitics 'd
    segmented = re.sub(r"(’ll)", r" \1", segmented) # clitics ’ll
    segmented = re.sub(r"('ll)", r" \1", segmented) # clitics 'll
    segmented = re.sub(r'\s+', ' ',   segmented) # Remove duplicate whitespaces*
    
    heuristic_seg_docs.append(segmented)
    

In [9]:
# Show data
df = pd.DataFrame({'Tweets':heuristic_seg_docs})
df.head()

Unnamed: 0,Tweets
0,check out my class in
1,extending a big thank you to our community pan...
2,blueberry
3,bad day ☹️®️
4,i 'm chim tho


The results off the UDPipe and the heuristic segmentation are quite different. The heuristic method works pretty good in English atleast. You have to take into consideration the language specific rules. The links and twitter handles are easy to remove with a few regexes. If the heuristic method gets slang or badly written english it might do some weird segmentations, ex. i'mma -> " i 'mma ". The heuristic model can target more specific segmentation needs, but also requires a lot of work to do it propperly. I prefered the use of the heuristic method because of the ability to easily remove all the hashtags, twitter handles, links and other trash in the text.

Comparing the first tweets shows that the links and hashtags have dissapeared in the heuristic method. Other wise they have pretty similar results.

The UDPipe machine learned model works very well. It takes into account clitics and hyphenated compound words.
The UDPipe can be trained and work with several languages. It seems to be a very good and fast way of segmenting words if you have a trained classifier. This seems to be the optimal way of segmenting texts if you want to do it fast and well. 

The both methods seem to have done a good job, but to truly meassure the performance of the two sgementation method you could meassue the amount of words that are correctly segmented. I will use the heurisitc method in the following calculations. 

# 4. Count a word frequency list 

**Count a word frequency list (how many times each word appears and how many unique words there are). Which are the most commmon words appearing in the data? What kind of words these are?**

In [10]:
# Create a word frequency counter

#from collections import Counter

token_counter = Counter()

for doc in heuristic_seg_docs: 
    tokenized = pipeline.process(doc)
    tokens = tokenized.split() # after segmenter, we can do whitespace splitting
    token_counter.update(tokens)

In [11]:
# Show data
print("Vocabulary size:", len(token_counter))
df = pd.DataFrame.from_dict(token_counter, orient='index').reset_index()
df = df.rename(columns={'index':'Word', 0:'Count'})
df.sort_values('Count', ascending=False)[0:20]

Vocabulary size: 17187


Unnamed: 0,Word,Count
69,.,5929
16,the,4345
92,",",3714
10,to,3445
6,a,2940
23,i,2871
78,and,2705
9,you,2663
63,of,2069
4,in,1945


The Vocabulary size of the segmented documents is 17187. 
Before cleaning the tokens, the most common words seem to be punctuation charachters and stop words. 

Let's clean the data a bit and then take a closer look. Let's remove some of the stop words and punctuation characters

In [12]:
# Remove stop words and punctuation characters

#import nltk
#nltk.download('stopwords') # download the stopwords dataset
#from nltk.corpus import stopwords

filtered_tokens = {}
punctuation_chars = '. .. , : ( ) ! !! ? ?? " = & - ; ... \\ " ” [ ] # @ “ / * % € $ '.split() # list of punctuation symbols to ignore
for word, count in token_counter.most_common():
    if word.lower() in stopwords.words("english") or word in punctuation_chars:
        continue
    filtered_tokens[word] = count

In [13]:
# Show data
print("Vocabulary size:", len(filtered_tokens))
df = pd.DataFrame.from_dict(filtered_tokens, orient='index').reset_index()
df = df.rename(columns={'index':'Word', 0:'Count'})
df.sort_values('Count', ascending=False)[0:20]

Vocabulary size: 17014


Unnamed: 0,Word,Count
0,'s,929
1,’s,642
2,christmas,579
3,'t,569
4,like,532
5,’t,474
6,one,456
7,’,447
8,love,438
9,people,396


After removing stop words, some punctuation the vocabulary size changed to 17014 which is only 173 tokens removed.

The vocabulary still has some emoticons/emojis/dingbats/symbols etc. We will ignore those for now.

There are 17014 unique "words". As mentioned above, there is still some unwanted elements in the texts.

## The most common words

The some of most common words are the: 
* 's clitic 
* Christmas
* 't clitic
* like
* one
* love
* people
* new
* year
* get
* day 
* 2017
* good
* time
* today

If we only look at the words, not numbers or clitics then we have:
* Christmas
* like
* one
* love
* people
* new
* year
* get
* day 
* good
* time
* today

### What kind of words are these?

**The words are mostly Nouns** with a few verbs and adjectives in the mix.


In [14]:
# The 30 most common words 
df.sort_values('Count', ascending=False)[0:30]

Unnamed: 0,Word,Count
0,'s,929
1,’s,642
2,christmas,579
3,'t,569
4,like,532
5,’t,474
6,one,456
7,’,447
8,love,438
9,people,396


## 5. Calculate **idf** weight for each word appearing in the data 
**Calculate idf weight for each word appearing in the data (one tweet = one document), and print top 20 words with lowest and highest idf values. Why tf not really matter when processing tweets?**

Let's start by creating the TF and IDF functions. After that let's calculate the tf and idfs from the documents. The tf and idf only has words that are found in the filtered tokens list. 

I will use filtered_tokens for the IDF values. The list only contains words that are found in the filtered_tokens. I only want to show the IDF of the "meaningful" words and not the stop words. If I would use all the words in the documents, it should not affect the result all that much because of IDF. The more frequent a word is, the lower the value. 



In [15]:
# Calculate TF for all
# Puts the words in to a dictionary containing dictionaries of all the words in that document   
def calculateTF(documents):
    wordDictionary = {}

    tfDocuments = {}
    key = 0
    
    for doc in documents:
    
        docWordDictionary = {}
        # Split words, then add them to a dictionary
        bagOfWords = doc.split(" ")
        for word in bagOfWords:
            if word in docWordDictionary:
                docWordDictionary[word] += 1
            else:
                docWordDictionary[word] = 1

        # Append document and incremet dictionary key
        tfDocuments[key] = docWordDictionary
        key += 1
    
    return tfDocuments
    

In [16]:
# Calculates Idf for all words
# Takes in a TF dictionary and returns a dictionary with words and their idf values
def calculateIDF(documents):
    import math
    n = len(documents)
    
    idf = {}
    
    # loop through all words (every word is unique, count of word is not taken into account). 
    # If word is in filtered_tokens then add it. 
    for key, value in documents.items():
        for word in value:
            if(word in filtered_tokens):
                if word in idf:
                    idf[word] += 1
                else:
                    idf[word] = 1

    # Convert count to idf with log(n/count)
    for word, count in idf.items():
            idf[word] = math.log(n/float(count))

    
    return idf

In [17]:
# Calculates TF and IDF values
tfDocuments = calculateTF(heuristic_seg_docs)
idfValues = calculateIDF(tfDocuments)    

In [18]:
# Show data, Highest idf values
df = pd.DataFrame.from_dict(idfValues, orient='index').reset_index()
df = df.rename(columns={'index':'Word', 0:'idf'})
df.sort_values('idf', ascending=False)[0:20]

Unnamed: 0,Word,idf
16162,fridge,9.21034
6449,booing,9.21034
11546,cinderella,9.21034
11545,spotlight,9.21034
11544,coolest,9.21034
11542,grimey,9.21034
11540,shirahoshi,9.21034
6436,consumer,9.21034
11539,minimalist,9.21034
11538,yeahhh,9.21034


In [19]:
# Show data, Lowest idf values
df = pd.DataFrame.from_dict(idfValues, orient='index').reset_index()
df = df.rename(columns={'index':'Word', 0:'idf'})
df.sort_values('idf', ascending=True)[0:20]

Unnamed: 0,Word,idf
28,'s,2.544657
56,’s,2.915074
30,christmas,2.945039
150,like,3.036554
160,'t,3.042824
60,’t,3.184474
115,one,3.186893
420,love,3.267541
63,people,3.352407
296,new,3.381395


The term frequency does not really matter because of the lenght and language used in twitter. If we would calculate the TF of the unfiltered documents, there would be alot of stop words and other trash bloating the calculations. Term Frequency in itself does not explain alot about the meaning of the document. If the user repeats a word often in a tweet it will also have a skewed TF value. Term frequency in itself is not a great method of processing tweets, but when combined with IDF we can get the TF-IDF values.

IDF takes into account how often the word is present accros all the documents. The more it is present, the lower value it will have. This will filter better out the stop words and frequently used words. 

Depending on the task, it is wrong to asume that the stop words are pure noise in the data, but in this exercise I decided to remove them from the IDF values

# 6. BONUS: Find duplicate or near duplicate tweets
**Find duplicate or near duplicate tweets (in terms of text field only) in the data using any method you see fit. What kind of techniques you considered using and/or tested, and how many duplicate or near duplicate did you find?**

I though first about a very slow brute force method of comparing the strings. You could remove the spaces and only comapre the coherent string that is left. This method would find the "EXACT" match of a string. 

Next I thought about creating a matrix of the TF values then comparing one TF clolumn with the next. This is also one way to get the near duplicate tweets by words in the text. This method does not care about the word order. 

Lastly I remembered something called cosine similarity from math lectures in Aalto. After a bit of googling I found out a way to use TF-IDF to calculate the cosine similarity. So let's turn the matrix of TF-IDF values to vectors and calculate the similarity. This method should be able to find the duplicates and near duplicates because it gives a similarity measure. 

Let's start by trying to implement the TF-IDF and then calculate the cosine-similarity.

I'm not entirely sure if i should calculate the TF-IDF with all the stop words but in this case i have only used words that are filtered.

In [20]:
def calculateTFIDF(tf, idf):
    
    tfidf = {}
    key = 0
    
    # create TF-IDF dictionaries of all the words, Will be used as vectors later with cosine similarity
    for key, value in tf.items():
        docTfIdf = dict(filtered_tokens)
        docTfIdf = dict.fromkeys(docTfIdf, 0)
        
        # Convert TF to TF score with Frequency/total words
        totalWordsInDoc = sum(value.values())
        for word in value:
            if(word in filtered_tokens):
                docTfIdf[word] = value[word]/totalWordsInDoc * idf[word]
                
                
       
        tfidf[key] = docTfIdf
        key += 1
        
    return tfidf

In [21]:
# Calculate TFIDF
tfIdfValues = calculateTFIDF(tfDocuments,idfValues)    

In [22]:
def calculateCosSim(tfidf):
    
    cosSim = {}
    key = 0
    
    # create TF-IDF dictionaries of all the words, Will be used as vectors later with cosine similarity
    for key, value in tfidf.items():
        similarity = {}
        key2 = 0
        for compareKey, compareValue in tfidf.items():
            if(compareKey <= key):
                key2 += 1
                pass
            else:
                sim = CosineSimilarity(value, compareValue)
                # If similarity is above 90, the tweet is neear duplicate and if it is 1 it is a duplicate tweet
                if(sim > 0.9):
                    #print(sim)
                    similarity[key2] = sim
                key2 += 1
        
        #cosSim[key] = similarity
        key += 1
        #print(key)
        
        
        
    return cosSim
            
            
def CosineSimilarity (d1, d2):
    d1List = list(d1.values())
    d2List = list(d2.values())
                  
    dotProd = np.dot(d1List, d2List)
    
    d1Squared = [i*i for i in d1List]
    d2Squared = [i*i for i in d2List]
    d1Sqaured = np.sqrt(sum(d1Squared))
    d2Sqaured = np.sqrt(sum(d2Squared))
    denominator = d1Sqaured * d2Sqaured
    
    similarity = dotProd/denominator
    
    return similarity
    

In [23]:
# Calculate cosine similarity dictionary
CosineSimDict = calculateCosSim(tfIdfValues)




KeyboardInterrupt: 

### Running this function on a single thread would take ages

I began to read articles about speeding machine learning and data scinece methods in Python. I came across a library called numba that would enable me to use Cuda cores for processing. After toying around with cuda I tried to apply it to this project. I found out fast that the topic is more complex and needs more hours of reading the documentation to understand it. I got stuck with the np.sum(), dictionaries and other kinds of problems. I still wanted to represent som data so i run this code in Notebook CSC for the first 100 entires and compared them to all documents. I found out that in the first 100 there were:

* 130 duplicates 
* 4 near duplicates

In the future I am intrigued to learn more about splitting the processing workload to the GPU.

When inspecting the examples of the duplicates i noticed that this method should be run with all the words in it if you want to have more accurate representation of duplicate tweets. Some of the tweets have small variations that are droped out by the heuristic method of segmenting the data or then by dropping the stop words. The texts context are almost the same when running this method. 

Below are the duplicate tweets

In [24]:
# Takes in array and prints all the documents based on keys from array
def printDuplicateTweets(numbers):
    for x in numbers:
        print(heuristic_seg_docs[x])

In [25]:
printDuplicateTweets([0,4352,9765])

check out my class in 
check out my class in 
check out my class in 


In [26]:
printDuplicateTweets([3,4241])

bad day ☹️®️
bad day ☹️®️


In [27]:
printDuplicateTweets([10,256])

use cases/application areas of in offline/ 
use cases/application areas of in offline/ 


In [28]:
printDuplicateTweets([12,378,4474,4617,4620,8505])

our dad passed away earlier this summer so my mom and i decided to surprise my sisters with bears with his favorite cologne and a recording of his voice . it ’s not christmas without you dad , but we have you in spirit ❤️ 
our dad passed away earlier this summer so my mom and i decided to surprise my sisters with bears with his favorite cologne and a recording of his voice . it ’s not christmas without you dad , but we have you in spirit ❤️ 
our dad passed away earlier this summer so my mom and i decided to surprise my sisters with bears with his favorite cologne and a recording of his voice . it ’s not christmas without you dad , but we have you in spirit ❤️ 
our dad passed away earlier this summer so my mom and i decided to surprise my sisters with bears with his favorite cologne and a recording of his voice . it ’s not christmas without you dad , but we have you in spirit ❤️ 
our dad passed away earlier this summer so my mom and i decided to surprise my sisters with bears with his f

In [29]:
printDuplicateTweets([15,1128])

president trump cuts funding to un after israel vote - newsweek 
president trump cuts funding to un after israel vote - newsweek 


In [30]:
printDuplicateTweets([26,1094,1317,1391,2072,2138,3563,5382,5779,6225,6652,8530,8794,9809,9957])

y’all could ’ve just said that a transgender couple have a baby rather than giving me brain damage 
y’all could ’ve just said that a transgender couple have a baby rather than giving me brain damage 
y’all could ’ve just said that a transgender couple have a baby rather than giving me brain damage 
y’all could ’ve just said that a transgender couple have a baby rather than giving me brain damage 
y’all could ’ve just said that a transgender couple have a baby rather than giving me brain damage 
y’all could ’ve just said that a transgender couple have a baby rather than giving me brain damage 
y’all could ’ve just said that a transgender couple have a baby rather than giving me brain damage 
y’all could ’ve just said that a transgender couple have a baby rather than giving me brain damage 
y’all could ’ve just said that a transgender couple have a baby rather than giving me brain damage 
y’all could ’ve just said that a transgender couple have a baby rather than giving me brain damage 


In [31]:
printDuplicateTweets([30,1563,4142,4161,8557])

tron is a long term hodl . those who are flipping it are losing out . 
tron is a long term hodl . those who are flipping it are losing out . 
tron is a long term hodl . those who are flipping it are losing out . 
tron is a long term hodl . those who are flipping it are losing out . 
tron is a long term hodl . those who are flipping it are losing out . 


In [32]:
printDuplicateTweets([35,1255,6410,6981])

 itranslategames lady gaga ☂ zara larsson
 itranslategames lady gaga ☂ zara larsson
 itranslategames lady gaga ☂ zara larsson
 itranslategames lady gaga ☂ zara larsson


In [33]:
printDuplicateTweets([40,177,1098,4234,4861,7270,9306,9977])

my lil filipino mom thought her iphone x was perfume and i cry everytime i watch it . 
my lil filipino mom thought her iphone x was perfume and i cry everytime i watch it . 
my lil filipino mom thought her iphone x was perfume and i cry everytime i watch it . 
my lil filipino mom thought her iphone x was perfume and i cry everytime i watch it . 
my lil filipino mom thought her iphone x was perfume and i cry everytime i watch it . 
my lil filipino mom thought her iphone x was perfume and i cry everytime i watch it . 
my lil filipino mom thought her iphone x was perfume and i cry everytime i watch it . 
my lil filipino mom thought her iphone x was perfume and i cry everytime i watch it . 


In [34]:
printDuplicateTweets([42,409,2903,4475,5515,6083,6885,8670,9098,9825])

george lopez my wife and kids everybkdy hates chris the nanny fresh prince of belair rick moy degrassi law order the big bang theory sabrina the teenage witch full house boy meets world scooby-doo , where are you ? all that drake josh thats so raven even stevens zoey 101 
george lopez my wife and kids everybkdy hates chris the nanny fresh prince of belair rick moy degrassi law order the big bang theory sabrina the teenage witch full house boy meets world scooby-doo , where are you ? all that drake josh thats so raven even stevens zoey 101 
george lopez my wife and kids everybkdy hates chris the nanny fresh prince of belair rick moy degrassi law order the big bang theory sabrina the teenage witch full house boy meets world scooby-doo , where are you ? all that drake josh thats so raven even stevens zoey 101 
george lopez my wife and kids everybkdy hates chris the nanny fresh prince of belair rick moy degrassi law order the big bang theory sabrina the teenage witch full house boy meets w

In [35]:
printDuplicateTweets([43,1320,3233,3263,3485,3647,4509,6941,8705,9154,9187,9594,9880,9994])

it ’s a new day and a new world , or so it seems . yesterday ’s pr ... more for aries 
it ’s a new day and a new world , or so it seems . yesterday ’s pr ... more for aries 
it ’s a new day and a new world , or so it seems . yesterday ’s pr ... more for aries 
it ’s a new day and a new world , or so it seems . yesterday ’s pr ... more for aries 
it ’s a new day and a new world , or so it seems . yesterday ’s pr ... more for aries 
it ’s a new day and a new world , or so it seems . yesterday ’s pr ... more for aries 
it ’s a new day and a new world , or so it seems . yesterday ’s pr ... more for aries 
it ’s a new day and a new world , or so it seems . yesterday ’s pr ... more for aries 
it ’s a new day and a new world , or so it seems . yesterday ’s pr ... more for aries 
it ’s a new day and a new world , or so it seems . yesterday ’s pr ... more for aries 
it ’s a new day and a new world , or so it seems . yesterday ’s pr ... more for aries 
it ’s a new day and a new world , or so it 

In [36]:
printDuplicateTweets([44,3304,4003])

 nypl_tweeter lady gaga ☂ zara larsson
 nypl_tweeter lady gaga ☂ zara larsson
 nypl_tweeter lady gaga ☂ zara larsson


In [37]:
printDuplicateTweets([46,6011])

miss you 
i miss you


In [38]:
printDuplicateTweets([47,789,1790,3814,4136,4138,4358,6170])

 if you knew what my celebration was 
 if you knew what my celebration was 
 if you knew what my celebration was 
 if you knew what my celebration was 
 if you knew what my celebration was 
 if you knew what my celebration was 
 if you knew what my celebration was 
 if you knew what my celebration was 


In [39]:
printDuplicateTweets([48,2748])

 lmao this is amazing still funny 
 lmao this is amazing still funny 


In [40]:
printDuplicateTweets([55,9753])

this year , trump has spent $ 91 , 655 , 424 of american taxpayer money on golf trips . if you are struggling now , to buy present , pay your heat bill , buy medication , or anything else , just remember this one simple fact . trump does not care about you . 
this year , trump has spent $ 91 , 655 , 424 of american taxpayer money on golf trips . if you are struggling now , to buy present , pay your heat bill , buy medication , or anything else , just remember this one simple fact . trump does not care about you . 


In [41]:
printDuplicateTweets([58,2604])

dear twits i wish you all a happy christ 's bihday anniversary , regardless of your culture , class , religious orientation , sexual preference , or race , provided only that you don 't vote next year for the delusional five-year-old currently destroying everything i love about the usa
dear twits i wish you all a happy christ 's bihday anniversary , regardless of your culture , class , religious orientation , sexual preference , or race , provided only that you don 't vote next year for the delusional five-year-old currently destroying everything i love about the usa


In [42]:
printDuplicateTweets([60,9217])

that ’s a big mood 
big mood 


In [43]:
printDuplicateTweets([61,156,651,997,5173,8953])

3 people followed me and one person unfollowed me // automatically checked by 
one person followed me and 3 people unfollowed me // automatically checked by 
10 people followed me and one person unfollowed me // automatically checked by 
2 people followed me and one person unfollowed me // automatically checked by 
2 people followed me and one person unfollowed me // automatically checked by 
one person followed me and 3 people unfollowed me // automatically checked by 


In [44]:
printDuplicateTweets([65,4656])

no more “my bad i didn ’t see your call” in 2018 . i seen it , i ignored it . 
no more “my bad i didn ’t see your call” in 2018 . i seen it , i ignored it . 


In [45]:
printDuplicateTweets([74,75,4208,5177])

dear god , thank you 
dear god , thank you 
dear god , thank you 
dear god , thank you 


In [46]:
printDuplicateTweets([75,4208,5177])

dear god , thank you 
dear god , thank you 
dear god , thank you 


In [47]:
printDuplicateTweets([82,4749])

this is one of my fav looks on yoongi 
this is one of my fav looks on yoongi 


In [48]:
printDuplicateTweets([86,115,301,1003,2171,2199,2912,3616,3648,5024,5049,5998,6577,6936,7458,8506,8958,9087,9625,9800])

an unfinished discussion with a family member may unexpectedly ... more for virgo 
an unfinished discussion with a family member may unexpectedly ... more for virgo 
an unfinished discussion with a family member may unexpectedly ... more for virgo 
an unfinished discussion with a family member may unexpectedly ... more for virgo 
an unfinished discussion with a family member may unexpectedly ... more for virgo 
an unfinished discussion with a family member may unexpectedly ... more for virgo 
an unfinished discussion with a family member may unexpectedly ... more for virgo 
an unfinished discussion with a family member may unexpectedly ... more for virgo 
an unfinished discussion with a family member may unexpectedly ... more for virgo 
an unfinished discussion with a family member may unexpectedly ... more for virgo 
an unfinished discussion with a family member may unexpectedly ... more for virgo 
an unfinished discussion with a family member may unexpectedly ... more for virgo 
an u

In [49]:
printDuplicateTweets([91,6714])

their christmas tree 
their christmas tree 


In [50]:
printDuplicateTweets([98,5315,8282])

 a festival created by prof maulana karenga he was convicted of beating women , after removing their clothes . they were also burned with hot soldering irons , while detergent running hoses were forced down in their throats celebrate celebrate abuse 
 a festival created by prof maulana karenga he was convicted of beating women , after removing their clothes . they were also burned with hot soldering irons , while detergent running hoses were forced down in their throats celebrate celebrate abuse 
 a festival created by prof maulana karenga he was convicted of beating women , after removing their clothes . they were also burned with hot soldering irons , while detergent running hoses were forced down in their throats celebrate celebrate abuse 
