# Introduction to NLP: Assignment 1 

### Description of Assignment 1

This assignment relates to Theme Basic NLP of the introduction to NLP (courses Deskriptiv analytik / Machine learning for descriptive problems), and will focus on basic text processing methods on Python.

The assignment is handed in as a Jupyter notebook containing the code used to solve the problem, output presenting the results, and, most importantly, notes that present the students' conclusions and answer questions posed in the assignment. 

**Assignment steps/Questions:**

Download and extract data package from here [http://dl.turkunlp.org/intro-to-nlp.tar.gz](http://dl.turkunlp.org/intro-to-nlp.tar.gz). Gzipped file intro-to-nlp/english-tweets-sample.jsonl.gz includes 10,000 English tweets downloaded from the Twiter API. The file is compressed and in JSON Lines format ([http://jsonlines.org/](http://jsonlines.org/)), i.e. one json per line.

Note: If processing the whole fiel takes too long, it's okay to read just a subset of the data, for example only 2,000 tweets...

1. Read tweets in Python 
2. Extract the actual text fields from the tweet jsons, discard all metadata at this point. Note that sometimes the text may be truncated to fit the old character limit. In these cases, is it possible to get the full text?
3. Segment each tweet using both UDPipe machine learned model (can be found from the same data package) and a heuristic method. What can you tell about the segmentation performance on tweets when manually inspecting few examples?
4. Count a word frequency list (how many times each word appears and how many unique words there are). Which are the most commmon words appearing in the data? What kind of words these are?
5. Calculate **idf** weight for each word appearing in the data (one tweet = one document), and print top 20 words with lowest and highest idf values. Why **tf** not really matter when processing tweets?
6. Find duplicate or near duplicate tweets (in terms of text field only) in the data using any method you see fit. What kind of techniques you considered using and/or tested, and how many duplicate or near duplicate did you find?

### Import libraries

In [2]:
import json
import gzip
import ufal.udpipe as udpipe
import re
import pandas as pd
from collections import Counter
import nltk
nltk.download('stopwords') # download the stopwords dataset
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Niklas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 1. Read tweets in Python 

Let's start by reading in the tweets from the gzipped jsonline file. 

In [3]:
# Read data from gzip file and decode the json to a list 

#import json
#import gzip

data = []
with gzip.open('english-tweets-sample.jsonl.gz', 'rb')as f:
    for line in f:        
        data.append(json.loads(line))

In [4]:
# Show information of the data
print("Number of documents:", len(data))
print("Data type:", type(data))
print("First item type:", type(data[0]))
print("First item:", data[0])

Number of documents: 10000
Data type: <class 'list'>
First item type: <class 'dict'>
First item: {'created_at': 'Tue Dec 26 14:16:22 +0000 2017', 'id': 945659557480611840, 'id_str': '945659557480611840', 'text': 'Check out my class in #GranblueFantasy! https://t.co/pAvXn8diJr', 'display_text_range': [0, 39], 'source': '<a href="http://granbluefantasy.jp/" rel="nofollow">グランブルー ファンタジー</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 883980236655779840, 'id_str': '883980236655779840', 'name': 'Pc Kwok', 'screen_name': 'jensenpck', 'location': None, 'url': None, 'description': None, 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 0, 'friends_count': 1, 'listed_count': 0, 'favourites_count': 0, 'statuses_count': 42, 'created_at': 'Sun Jul 09 09:24:46 +0000 2017', 'utc_offset': None, 'time_zone': None, '

## 2. Extract the actual text fields from the tweet jsons

**Extract the actual text fields from the tweet jsons, discard all metadata at this point. Note that sometimes the text may be truncated to fit the old character limit. In these cases, is it possible to get the full text?**

Let's start by extracting only the text fields from the tweets. Some of the tweets are truncated and yes it is possible to extract the full text from truncated tweets. 

There are two kind of truncated texts:
1. Original tweets that are truncated. (Value of truncated == True)
2. Retweeted tweets that are truncated. 

The retweeted tweets do not say from the truncated status if they are truncated. From the retweeted_status it is possible to check if the original tweet has been truncated and to extract the full_text tweet.

In [8]:
# Extract the actual text field an ddiscard meta data
documents = []

for d in data:
    # Check if retweeted
    if("retweeted_status" in d):
        # Check if retweet is truncated
        if(d["retweeted_status"]["truncated"] == True):
            documents.append(d["retweeted_status"]["extended_tweet"]["full_text"])    
        else:
            documents.append(d["retweeted_status"]["text"])
    # Check if original tweet is truncated                
    elif(d["truncated"] == True): 
        documents.append(d["extended_tweet"]["full_text"])
    # Default case
    else:
        documents.append(d["text"])

In [9]:
# Show information of the data
print("Number of documents:", len(documents))
print("Documents type:", type(documents))
print("First item type:", type(documents[0]))
print("First item:", documents[0])
print("Second item:", documents[1])
print("Third item:", documents[2])

Number of documents: 10000
Documents type: <class 'list'>
First item type: <class 'str'>
First item: Check out my class in #GranblueFantasy! https://t.co/pAvXn8diJr
Second item: Extending a big Thank You to our Community Partner all over the world! https://t.co/cu7on7g1si
Third item: Blueberry 🍨 https://t.co/2gzHAFWYJY


## 3. Segment each tweet using both UDPipe machine learned model and a heuristic method

**Segment each tweet using both UDPipe machine learned model (can be found from the same data package) and a heuristic method. What can you tell about the segmentation performance on tweets when manually inspecting few examples?**

In [10]:
# Segmentation with UDPipe machine learned model

#import ufal.udpipe as udpipe

model = udpipe.Model.load("en.segmenter.udpipe")
pipeline = udpipe.Pipeline(model,"tokenize","none","none","horizontal")

segmented_documents = []

for d in documents:
    segmented_documents.append(pipeline.process(d))


In [62]:
# Show data
df = pd.DataFrame({'Tweets':segmented_documents})
df.head()

Unnamed: 0,Tweets
0,Check out my class in # GranblueFantasy !\nhtt...
1,Extending a big Thank\nYou to our Community Pa...
2,Blueberry 🍨 https://t.co/2gzHAFWYJY\n
3,Bad day ☹️®️\n
4,@prologve_ @BTS_ARMY @BTS_twt I 'm Chim tho\n


In [115]:
# Segmentation with a heuristic model

#import re

heuristic_seg_docs = []

for d in documents:
    segmented = re.sub(r'https?:\/\/.*\/\w*', '', d) #remove links
    segmented = re.sub(r'@[^\s]+', '', segmented) #remove twitter handles
    segmented = re.sub(r'#[^\s]+', '', segmented) #remove hashtags
    segmented = re.sub(r'(&.+;)', '', segmented) #remove &amp;  &gt;  ...
    segmented = re.sub(r'[\U00010000-\U0010ffff]', '', segmented) #remove some emojis
    segmented = segmented.lower()
    segmented = re.sub(r'rt', '', segmented) #remove retweets
    segmented = re.sub(r'([. , ! ? : ; # @ & - € $]+)', r' \1 ', segmented) # replace 
    segmented = re.sub(r"(’t)", r" \1", segmented) # clitics n’t
    segmented = re.sub(r"('t)", r" \1", segmented) # clitics n't
    segmented = re.sub(r"(’s)", r" \1", segmented) # clitics ’s
    segmented = re.sub(r"('s)", r" \1", segmented) # clitics 's
    segmented = re.sub(r"(’re)", r" \1", segmented) # clitics ’re
    segmented = re.sub(r"('re)", r" \1", segmented) # clitics 're
    segmented = re.sub(r"(’m)", r" \1", segmented) # clitics ’m
    segmented = re.sub(r"('m)", r" \1", segmented) # clitics 'm
    segmented = re.sub(r"(’ve)", r" \1", segmented) # clitics ’ve
    segmented = re.sub(r"('ve)", r" \1", segmented) # clitics ve
    segmented = re.sub(r"(’d)", r" \1", segmented) # clitics ’d
    segmented = re.sub(r"('d)", r" \1", segmented) # clitics 'd
    segmented = re.sub(r"(’ll)", r" \1", segmented) # clitics ’ll
    segmented = re.sub(r"('ll)", r" \1", segmented) # clitics 'll
    segmented = re.sub(r'\s+', ' ',   segmented) # Remove duplicate whitespaces*
    
    heuristic_seg_docs.append(segmented)
    

In [160]:
# Show data
df = pd.DataFrame({'Tweets':heuristic_seg_docs})
df.head()

Unnamed: 0,Tweets
0,check out my class in
1,extending a big thank you to our community pan...
2,blueberry
3,bad day ☹️®️
4,i 'm chim tho


The results off the UDPipe and the heuristic segmentation are quite different. The heuristic method works pretty good in English atleast. You have to take into consideration the language specific rules. The links and twitter handles are easy to remove with a few regexes. If the heuristic method gets slang or badly written english it might do some weird segmentations, ex. i'mma -> " i 'mma ". The heuristic model can target more specific segmentation needs, but also requires a lot of work to do it propperly. I prefer to this example the heuristic method because of the ability to easily remove all the hashtags, twitter handles, links and other trash in the text.

Comparing the first tweets shows that the links and hashtags have dissapeared in the heuristic method. Other wise they have pretty similar results.

The UDPipe machine learned model works very well. It takes into account clitics and hyphenated compound words.
The UDPipe can be trained and work with several languages. It seems to be a very good and fast way of segmenting words if you have a trained classifier. This seems to be the optimal way of segmenting texts if you want to do it fast and well. 

The both methods seem to have done a good job, but to truly meassure the performance of the two sgementation method you could meassue the amount of words that are correctly segmented. I will use the heurisitc method in the following calculations. 

# 4. Count a word frequency list 

**Count a word frequency list (how many times each word appears and how many unique words there are). Which are the most commmon words appearing in the data? What kind of words these are?**

In [117]:
# Create a word frequency counter

#from collections import Counter

token_counter = Counter()

for doc in heuristic_seg_docs: 
    tokenized = pipeline.process(doc)
    tokens = tokenized.split() # after segmenter, we can do whitespace splitting
    token_counter.update(tokens)

In [118]:
# Show data
print("Vocabulary size:", len(token_counter))
df = pd.DataFrame.from_dict(token_counter, orient='index').reset_index()
df = df.rename(columns={'index':'Word', 0:'Count'})
df.sort_values('Count', ascending=False)[0:20]

Vocabulary size: 17187


Unnamed: 0,Word,Count
69,.,5929
16,the,4345
92,",",3714
10,to,3445
6,a,2940
23,i,2871
78,and,2705
9,you,2663
63,of,2069
4,in,1945


The Vocabulary size of the segmented documents is 17187. 
Before cleaning the tokens, the most common words seem to be punctuation charachters and so called stop words. 

Let's clean the data a bit and then take a closer look. Let's remove some of the stop words and punctuation characters

In [111]:
# Remove stop words and punctuation characters

#import nltk
#nltk.download('stopwords') # download the stopwords dataset
#from nltk.corpus import stopwords

filtered_tokens = []
punctuation_chars = '. .. , : ( ) ! !! ? ?? " = & - ; ... \\ " ” [ ] # @ “ / * % € $ '.split() # list of punctuation symbols to ignore
for word, count in token_counter.most_common():
    if word.lower() in stopwords.words("english") or word in punctuation_chars:
        continue
    filtered_tokens.append((word, count))

In [127]:
# Show data
print("Vocabulary size:", len(filtered_tokens))
df = pd.DataFrame(filtered_tokens)
df = df.rename(columns={0:'Word', 1:'Count'})
df.sort_values('Count', ascending=False)[0:20]

Vocabulary size: 17014


Unnamed: 0,Word,Count
0,'s,929
1,’s,642
2,christmas,579
3,'t,569
4,like,532
5,’t,474
6,one,456
7,’,447
8,love,438
9,people,396


After removing stop words, some punctuation the vocabulary size changed to 17014 which is only 173 tokens removed.

The vocabulary still has some emoticons/emojis/dingbats/symbols etc. We will ignore those for now

# TODO unique words and how many times each word appears

The some of most common words are the: 
* 's clitic 
* Christmas
* 't clitic
* like
* one
* love
* people
* new
* year
* get
* day 
* 2017
* good
* time
* today

If we only look at the words, not numbers or clitics then we have:
* Christmas
* like
* one
* love
* people
* new
* year
* get
* day 
* good
* time
* today

**The words are mostly Nouns** with a few verbs and adjectives in the mix.


In [130]:
# The 30 most common words 
df.sort_values('Count', ascending=False)[0:30]

Unnamed: 0,Word,Count
0,'s,929
1,’s,642
2,christmas,579
3,'t,569
4,like,532
5,’t,474
6,one,456
7,’,447
8,love,438
9,people,396


## 5. Calculate **idf** weight for each word appearing in the data 
**Calculate idf weight for each word appearing in the data (one tweet = one document), and print top 20 words with lowest and highest idf values. Why tf not really matter when processing tweets?**

Let's start by creating the TF and IDF functions


In [197]:
# Calculate TF for all
wordDictionary = dict(filtered_tokens)
wordDictionary.update({key : 0 for key in wordDictionary.keys()})

tfDocuments = {}
key = 0

for doc in heuristic_seg_docs:
    
    docWordDictionary = {}

    bagOfWords = doc.split(" ")
    for word in bagOfWords:
        if word in docWordDictionary:
            docWordDictionary[word] += 1
        else:
            docWordDictionary[word] = 1

    tfDocuments[key] = docWordDictionary
    key += 1

In [249]:
def calculateIDF(documents):
    import math
    n = len(documents)
    
    idf = dict(filtered_tokens)
    idf.update({key : 0 for key in idf.keys()})
    
    for key, value in documents.items():
        for word in value:
            try:
                idf[word] += 1
            except:
                pass
        
        #for word in doc:
         #   idf[word] += 1

    for word, count in idf.items():
        try:
            idf[word] = math.log(n/float(count))
        except:
            pass
        #print(float(count))
        
    
    return idf


idfValues = calculateIDF(tfDocuments)    

In [254]:
df = pd.DataFrame.from_dict(idfValues, orient='index').reset_index()
df = df.rename(columns={'index':'Word', 0:'idf'})
df.sort_values('idf', ascending=False)[0:20]

Unnamed: 0,Word,idf
8507,kafirs,9.21034
11122,727,9.21034
11106,militaryoffice,9.21034
11107,photocard,9.21034
11108,ouch,9.21034
11109,richmond,9.21034
11110,bumble,9.21034
11111,speedruns,9.21034
11112,opengl,9.21034
11113,tutorials,9.21034


In [255]:
df = pd.DataFrame.from_dict(idfValues, orient='index').reset_index()
df = df.rename(columns={'index':'Word', 0:'idf'})
df.sort_values('idf', ascending=True)[0:20]

Unnamed: 0,Word,idf
12279,xxxtentacion,0.0
15606,/it,0.0
8644,adley,0.0
15621,-open,0.0
8626,29-12,0.0
15634,nu,0.0
8619,crippled,0.0
15641,+1,0.0
15642,sathi,0.0
15643,-twe,0.0


In [256]:
def calculateIDF(documents):
    import math
    n = len(documents)
    
    idf = {}
    
    for key, value in documents.items():
        for word in value:
            if word in idf:
                idf[word] += 1
            else:
                idf[word] = 1


    for word, count in idf.items():
            idf[word] = math.log(n/float(count))

    
    return idf


idfValues = calculateIDF(tfDocuments)    

In [257]:
df = pd.DataFrame.from_dict(idfValues, orient='index').reset_index()
df = df.rename(columns={'index':'Word', 0:'idf'})
df.sort_values('idf', ascending=False)[0:20]

Unnamed: 0,Word,idf
19112,fridge,9.21034
9551,it‘s,9.21034
9539,tan,9.21034
9540,bulldog,9.21034
9541,murali,9.21034
9542,sharmawe,9.21034
9543,costar,9.21034
9544,priory,9.21034
9547,endure,9.21034
9548,maknaes,9.21034


In [258]:
df = pd.DataFrame.from_dict(idfValues, orient='index').reset_index()
df = df.rename(columns={'index':'Word', 0:'idf'})
df.sort_values('idf', ascending=True)[0:20]

Unnamed: 0,Word,idf
5,,0.14387
66,.,1.090644
17,the,1.241329
11,to,1.363359
7,a,1.470981
89,",",1.474907
75,and,1.563032
24,i,1.639897
10,you,1.691733
60,of,1.799993


In [None]:
# Create IDF calculation method
def calculateIDF(documents):
    import math
    n = len(documents)
    
    idf = dict(filtered_tokens)
    idf.update({key : 0 for key in idf.keys()})
    
    for doc in documents:
        
        uniqueWords = dict.fromkeys(doc.split(" "),0)
        
        for word in uniqueWords:
            try:
                idf[word] += 1
            except:
                pass


    for word, count in idf.items():
        idf[word] = math.log(n/float(count))
        print(float(count))
        
    
    return idf


idfValues = calculateIDF(heuristic_seg_docs)    