# Introduction to NLP: Assignment 1 

### Description of Assignment 1

This assignment relates to Theme Basic NLP of the introduction to NLP (courses Deskriptiv analytik / Machine learning for descriptive problems), and will focus on basic text processing methods on Python.

The assignment is handed in as a Jupyter notebook containing the code used to solve the problem, output presenting the results, and, most importantly, notes that present the students' conclusions and answer questions posed in the assignment. 

**Assignment steps/Questions:**

Download and extract data package from here [http://dl.turkunlp.org/intro-to-nlp.tar.gz](http://dl.turkunlp.org/intro-to-nlp.tar.gz). Gzipped file intro-to-nlp/english-tweets-sample.jsonl.gz includes 10,000 English tweets downloaded from the Twiter API. The file is compressed and in JSON Lines format ([http://jsonlines.org/](http://jsonlines.org/)), i.e. one json per line.

Note: If processing the whole fiel takes too long, it's okay to read just a subset of the data, for example only 2,000 tweets...

1. Read tweets in Python 
2. Extract the actual text fields from the tweet jsons, discard all metadata at this point. Note that sometimes the text may be truncated to fit the old character limit. In these cases, is it possible to get the full text?
3. Segment each tweet using both UDPipe machine learned model (can be found from the same data package) and a heuristic method. What can you tell about the segmentation performance on tweets when manually inspecting few examples?
4. Count a word frequency list (how many times each word appears and how many unique words there are). Which are the most commmon words appearing in the data? What kind of words these are?
5. Calculate **idf** weight for each word appearing in the data (one tweet = one document), and print top 20 words with lowest and highest idf values. Why **tf** not really matter when processing tweets?
6. Find duplicate or near duplicate tweets (in terms of text field only) in the data using any method you see fit. What kind of techniques you considered using and/or tested, and how many duplicate or near duplicate did you find?

### Import libraries

In [1]:
import json
import gzip
import ufal.udpipe as udpipe
import re
from collections import Counter
import nltk
nltk.download('stopwords') # download the stopwords dataset
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Niklas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 1. Read tweets in Python 

Let's start by reading in the tweets from the gzipped jsonline file. 

In [2]:
# Read data from gzip file and decode the json to a list 

#import json
#import gzip

data = []
with gzip.open('english-tweets-sample.jsonl.gz', 'rb')as f:
    for line in f:        
        data.append(json.loads(line))

In [3]:
# Show information of the data
print("Number of documents:", len(data))
print("Data type:", type(data))
print("First item type:", type(data[0]))
print("First item:", data[0])

Number of documents: 10000
Data type: <class 'list'>
First item type: <class 'dict'>
First item: {'created_at': 'Tue Dec 26 14:16:22 +0000 2017', 'id': 945659557480611840, 'id_str': '945659557480611840', 'text': 'Check out my class in #GranblueFantasy! https://t.co/pAvXn8diJr', 'display_text_range': [0, 39], 'source': '<a href="http://granbluefantasy.jp/" rel="nofollow">グランブルー ファンタジー</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 883980236655779840, 'id_str': '883980236655779840', 'name': 'Pc Kwok', 'screen_name': 'jensenpck', 'location': None, 'url': None, 'description': None, 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 0, 'friends_count': 1, 'listed_count': 0, 'favourites_count': 0, 'statuses_count': 42, 'created_at': 'Sun Jul 09 09:24:46 +0000 2017', 'utc_offset': None, 'time_zone': None, '

## 2. Extract the actual text fields from the tweet jsons

**Extract the actual text fields from the tweet jsons, discard all metadata at this point. Note that sometimes the text may be truncated to fit the old character limit. In these cases, is it possible to get the full text?**

Let's start by extracting only the text fields from the tweets. Some of the tweets are truncated and yes it is possible to extract the full text from truncated tweets. 

There are two kind of truncated texts:
1. Original tweets that are truncated. (Value of truncated == True)
2. Retweeted tweets that are truncated. 

The retweeted tweets do not say from the truncated status if they are truncated. From the retweeted_status it is possible to check if the original tweet has been truncated and to extract the full_text tweet.

In [4]:
# Extract the actual text field an ddiscard meta data
documents = []

for d in data:
    # Check if retweeted
    if("retweeted_status" in d):
        # Check if retweet is truncated
        if(d["retweeted_status"]["truncated"] == True):
            documents.append(d["retweeted_status"]["extended_tweet"]["full_text"])    
        else:
            documents.append(d["retweeted_status"]["text"])
    # Check if original tweet is truncated                
    elif(d["truncated"] == True): 
        documents.append(d["extended_tweet"]["full_text"])
    # Default case
    else:
        documents.append(d["text"])

In [5]:
# Show information of the data
print("Number of documents:", len(documents))
print("Documents type:", type(documents))
print("First item type:", type(documents[0]))
print("First item:", documents[0])
print("Second item:", documents[1])
print("Third item:", documents[2])

Number of documents: 10000
Documents type: <class 'list'>
First item type: <class 'str'>
First item: Check out my class in #GranblueFantasy! https://t.co/pAvXn8diJr
Second item: Extending a big Thank You to our Community Partner all over the world! https://t.co/cu7on7g1si
Third item: Blueberry 🍨 https://t.co/2gzHAFWYJY


## 3. Segment each tweet using both UDPipe machine learned model and a heuristic method

**Segment each tweet using both UDPipe machine learned model (can be found from the same data package) and a heuristic method. What can you tell about the segmentation performance on tweets when manually inspecting few examples?**

In [6]:
# Segmentation with UDPipe machine learned model

#import ufal.udpipe as udpipe

model = udpipe.Model.load("en.segmenter.udpipe")
pipeline = udpipe.Pipeline(model,"tokenize","none","none","horizontal")

segmented_documents = []

for d in documents:
    segmented_documents.append(pipeline.process(d))


In [7]:
print(segmented_documents[0])
print(segmented_documents[25])
print(segmented_documents[50])
print(segmented_documents[75])
print(segmented_documents[100])

Check out my class in # GranblueFantasy !
https://t.co/pAvXn8diJr

@geoffasenna
That ’s a pleasure Geoff !
Merry Christmas to you :)

RT @ TilcoPhoto : RT @ CasaMendoza2012 :
Hello friends !!
Pls - Like , follow , share , tweet , comment , re-post , # hashtag , invite your friends & amp ; rock out !!
https://t.co/muH7Djk31J https://t.co/QEUnnXv67o https://t.co/Wu2kYmMXc5 https://t.co/LSGDT9r3sG …
https://t.co/HVlNdJpM6e

Dear god , thank you 🙏 🏾

When Taehyung spend to much time with Jin : https://t.co/x9PUI8eWSV



In [8]:
# Segmentation with a heuristic model

#import re

heuristic_seg_docs = []

for d in documents:
    segmented = re.sub(r'https?:\/\/.*\/\w*', '', d) #remove links
    segmented = re.sub(r'@[^\s]+', '', segmented) #remove twitter handles
    segmented = re.sub(r'(&.+;)', '', segmented) #remove &amp;  &gt;  ...
    segmented = segmented.lower()
    segmented = re.sub(r'([. , ! ? : ; # @ & -]+)', r' \1 ', segmented) # replace 
    segmented = re.sub(r"(’t)", r" \1", segmented) # clitics n’t
    segmented = re.sub(r"('t)", r" \1", segmented) # clitics n't
    segmented = re.sub(r"(’s)", r" \1", segmented) # clitics ’s
    segmented = re.sub(r"('s)", r" \1", segmented) # clitics 's
    segmented = re.sub(r"(’re)", r" \1", segmented) # clitics ’re
    segmented = re.sub(r"('re)", r" \1", segmented) # clitics 're
    segmented = re.sub(r"(’m)", r" \1", segmented) # clitics ’m
    segmented = re.sub(r"('m)", r" \1", segmented) # clitics 'm
    segmented = re.sub(r"(’ve)", r" \1", segmented) # clitics ’ve
    segmented = re.sub(r"('ve)", r" \1", segmented) # clitics ve
    segmented = re.sub(r"(’d)", r" \1", segmented) # clitics ’d
    segmented = re.sub(r"('d)", r" \1", segmented) # clitics 'd
    segmented = re.sub(r"(’ll)", r" \1", segmented) # clitics ’ll
    segmented = re.sub(r"('ll)", r" \1", segmented) # clitics 'll
    segmented = re.sub(r'\s+', ' ',   segmented) # Remove duplicate whitespaces*
    
    heuristic_seg_docs.append(segmented)
    

In [9]:
print(heuristic_seg_docs[0])
print(heuristic_seg_docs[25])
print(heuristic_seg_docs[50])
print(heuristic_seg_docs[75])
print(heuristic_seg_docs[100])

check out my class in # granbluefantasy ! 
 that ’s a pleasure geoff ! merry christmas to you : )
rt rt hello friends !! pls - like , follow , share , tweet , comment , re - post , # hashtag , invite your friends rock out !! … 
dear god , thank you 🙏🏾
when taehyung spend to much time with jin : 


The results off the UDPipe and the heuristic segmentation are quite different. The heuristic method works pretty good in English atleast. You have to take into consideration the language specific rules. The links and twitter handles are easy to remove with a few regexes. If the heuristic method gets slang or badly written english it might do some weird segmentations, ex. i'mma -> " i 'mma ". The heuristic model can target more specific segmentation needs, but also requires a lot of work to do it propperly.

The UDPipe machine learned model works very well, much better than the heuristic method. It takes into account clitics and hyphenated compound words.
The UDPipe can be trained and work with several languages. It seems to be a very good and fast way of segmenting words if you have a trained classifier. 

The both methods seem to have done a good job, but to truly meassure the performance of the two sgementation method you could meassue the amount of words that are correctly segmented.

# 4. Count a word frequency list 

**Count a word frequency list (how many times each word appears and how many unique words there are). Which are the most commmon words appearing in the data? What kind of words these are?**

In [10]:
# Create a word frequency counter

#from collections import Counter

token_counter = Counter()
for doc in segmented_documents: 
    tokenized = pipeline.process(doc)
    tokens = tokenized.split() # after segmenter, we can do whitespace splitting
    token_counter.update(tokens)

In [11]:
# Show data 
print("Vocabulary size:", len(token_counter))
print("Most common tokens:", token_counter.most_common(100))

Vocabulary size: 36371
Most common tokens: [('.', 5515), ('#', 4312), (',', 3755), ('the', 3753), ('to', 3290), ('@', 2961), ('a', 2655), ('and', 2537), ('I', 2274), ('you', 2261), ('of', 2010), ('in', 1792), ('for', 1759), ('is', 1727), ('-', 1538), ('!', 1489), (':', 1476), ('it', 1194), ('on', 1121), ('that', 1025), ('this', 969), ("'s", 942), ('my', 878), ('with', 846), ('"', 801), ('?', 792), ('your', 771), ('be', 756), ('are', 725), ('me', 709), (';', 686), ('...', 654), ('do', 641), ('i', 640), ('’s', 630), ('&', 612), ('all', 566), ('have', 561), ('at', 554), ('The', 551), ("n't", 551), ('amp', 550), ('so', 548), ('not', 538), (')', 533), ('Christmas', 530), ('was', 516), ('(', 513), ('but', 495), ('like', 477), ('n’t', 472), ('from', 448), ('just', 435), ('they', 430), ('as', 424), ('by', 417), ('’', 417), ('one', 387), ('we', 380), ('up', 379), ('will', 373), ('who', 365), ('You', 362), ("'", 362), ('out', 360), ('people', 357), ('love', 354), ('he', 341), ('can', 334), ('if'

The Vocabulary size of the segmented documents is 36371. The ammount is so high because unique links, hashtags and @users twitter uses. 
Before cleaning the tokens, the most common words seem to be punctuation charachters and so called stop words. 

Let's clean the data a bit and then take a closer look. Let's remove some of the stop words and punctuation characters

In [12]:
# Remove stop words and punctuation characters

#import nltk
#nltk.download('stopwords') # download the stopwords dataset
#from nltk.corpus import stopwords

filtered_tokens = []
punctuation_chars = '. .. , : ( ) ! !! ? ?? " = & - ; ... \\ " ” [ ] # @ “ / * % € $ RT amp'.split() # list of punctuation symbols to ignore
for word, count in token_counter.most_common():
    if word.lower() in stopwords.words("english") or word in punctuation_chars:
        continue
    filtered_tokens.append((word, count))

In [13]:
# Show data
print("Vocabulary size:", len(filtered_tokens))
print("Tokens:", filtered_tokens)

Vocabulary size: 35975


After removing stop words, some punctuation characters and some text values used for markin retweets and ampersands. 
The vocabulary size changed to 35975 which is only 396 tokens removed.

Let's see how the vocabulary size changes when we remove all the unique tokens. This will remove some uniqe words, but most importantly it will remove the unique links and twitter users.

In [14]:
# Remove unique tokens
no_unique_tokens = list(filter(lambda x: (x[1]!= 1) , filtered_tokens)) 
# Show data
print("Vocabulary size:", len(no_unique_tokens))
print("Tokens:", no_unique_tokens)

Vocabulary size: 10621


After removing the unique tokens the vocabulary size dropped to 10261 which is drastically lower than 35975. This should be a more accuarate number of unique english words. There are still some emojis, users, links, slang, etc in the list. 

The some of most common words are the: 
* 's clitic 
* n't clitic
* Christmas
* like
* one
* people
* love
* 2017
* get
* 1
* year
* time
* 2018
* new
* day
* know

If we only look at the words we have:
* Christmas
* like
* one
* people
* love
* get
* year
* time
* new
* day
* know

**The words are mostly Nouns** with a few verbs and adjectives in the mix.


In [15]:
# The 30 most common words 
no_unique_tokens[:30]

[("'s", 942),
 ('’s', 630),
 ("n't", 551),
 ('Christmas', 530),
 ('like', 477),
 ('n’t', 472),
 ('’', 417),
 ('one', 387),
 ("'", 362),
 ('people', 357),
 ('love', 354),
 ('2017', 308),
 ('get', 301),
 ('1', 292),
 ('year', 278),
 ('time', 267),
 ('2018', 265),
 ('new', 248),
 ('day', 240),
 ('know', 237),
 ('today', 229),
 ('2', 217),
 ('see', 213),
 ('family', 213),
 ('want', 211),
 ('good', 208),
 ('got', 206),
 ('back', 203),
 ("'m", 195),
 ('life', 194)]

## 5. Calculate **idf** weight for each word appearing in the data 
**Calculate idf weight for each word appearing in the data (one tweet = one document), and print top 20 words with lowest and highest idf values. Why tf not really matter when processing tweets?**

Let's start by creating the TF and IDF functions


In [16]:
def calculateTF(dictionary, words):
    tf = {}
    dCount = len(words)
    
    
    for word, count in dictionary.items():
        tf[word] = count/float(dCount)
        
    return tf



In [17]:
# Calculate TF for all
wordDictionary = dict(filtered_tokens)
wordDictionary.update({key : 0 for key in wordDictionary.keys()})

tfDocuments = {}
key = 0

for doc in segmented_documents:
    
    docWordDictionary = wordDictionary
    
    bagOfWords = doc.split(" ")
    for word in bagOfWords:
        
        try:
            docWordDictionary[word] += 1
        except:
            pass
            
    tf = calculateTF(docWordDictionary, bagOfWords )
    tfDocuments[key] = tf
    key += 1




In [18]:
def calculateIDF(documents):
    import math
    n = len(documents)
    
    idf = dict(filtered_tokens)
    idf.update({key : 0 for key in idf.keys()})
    
    for doc in documents:
        
        uniqueWords = dict.fromkeys(doc.split(" "),0)
        
        for word in uniqueWords:
            try:
                idf[word] += 1
            except:
                pass


    for word, count in idf.items():
        idf[word] = math.log(n/float(count))
        break
    #print(idf["People"])
    #print(idf)
    return idf
    
        
idfValues = calculateIDF(segmented_documents)    

In [None]:
def calculateTFIDF(tf, idf):
    tfidf = {}
    key = 0
    
    for key, tfDict in tf.items():
        #print(tfDict)
        documentDict = {}
        #print(type(tfDict))
        for word, value in tfDict.items():
            
            documentDict[word] = value *  idf[word]
            #print(documentDict[word], word)
        
        #print(documentDict["Check"])    
        tfidf[key] = documentDict
        key += 1
        
    return tfidf
        
            
tfidfDocuments = calculateTFIDF(tfDocuments, idfValues)

In [336]:
print(segmented_documents[0])
print(idfValues["Check"])
print(tfDocuments[0]["Check"])

Check out my class in # GranblueFantasy !
https://t.co/pAvXn8diJr

22
0.125


In [19]:

key = 0
    
for key, tfDict in tfDocuments.items():
    #print(tfDict)
    documentDict = {}
    #print(type(tfDict))
    for word, value in tfDict.items():
        documentDict[word] = value *  idfValues[word]
        #print(documentDict[word], word)
        
#print(documentDict["Check"])    
tfDocuments[key] = documentDict
key += 1
        

In [21]:
tfDocuments[0]["Check"]

0.125