# Basic NLP
  * Reading and processing language data
  * Segmenting text
  * Calculating word frequencies and idf weights

* Exercises are based on tweets downloaded using Twitter API. Both Finnish and English tweets are available, you are free to choose which language you want to work with.

> Finnish: http://dl.turkunlp.org/intro-to-nlp/finnish-tweets-sample.jsonl.gz

> English: http://dl.turkunlp.org/intro-to-nlp/english-tweets-sample.jsonl.gz

* Both files include 10,000 tweets.


## 1) Read tweets in Python

* Download the file, and read the data in Python
* **The outcome** is a list of tweets, where each tweet is a dictionary including different (key, value) pairs

In [44]:
# import packages
import gzip
import urllib.request
import json

# download and read data
data = urllib.request.urlopen("http://dl.turkunlp.org/intro-to-nlp/english-tweets-sample.jsonl.gz")
with gzip.open(data, 'rb') as f:
  res = [json.loads(jline) for jline in f.read().splitlines()]

## 2) Extract texts from the tweet jsons

* Extract the actual text field for each tweet.
* **The outcome** is a list of tweets, where each tweet is a string.

In [46]:
tweets = []
# Extract text from each line
for i in res:
  tweets.append(i["text"])


## 3) Segment tweets

* Segment tweets using the UDPipe machine learned model.

> English model: https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/en.segmenter.udpipe

> Finnish model: https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/fi.segmenter.udpipe


* **The output** is a list of segmented tweets, where each tweet is a string.

In [None]:
# Import packages
!wget -nc https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/en.segmenter.udpipe
!pip3 install ufal.udpipe
import ufal.udpipe as udpipe

# Define model and pipeline
model = udpipe.Model.load("en.segmenter.udpipe")
pipeline = udpipe.Pipeline(model, "tokenize", "none", "none", "horizontal")

# Apply model to each line and append to a list
seg_doc = []
for i in range(0,len(tweets)):
  seg_doc.append(pipeline.process(tweets[i]))

# Print to check output
print(seg_doc)

## 4) Calculate word frequencies

* Calculate a word frequency list (how many times each word appears) based on the tweets. 
* Calculate the size of the vocabulary (how many unique words there are).
* **The output** is a sorted list of X most common words and their frequencies, and the number of unique words in the data.

In [48]:
from collections import Counter
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Initiate counter
token_counter = Counter()
# Loop through each line and split words, adding to the counter as we go
for i in range(0,len(seg_doc)):
  tokens = seg_doc[i].split()
  token_counter.update(tokens)

#filtered = []
#punct = '. , : ( ) ! ? " = & - ; ... \\ '.split() 
#for word, count in token_counter.most_common(100):
#  if word.lower() in stopwords.words("english") or word in punct:
#    continue
#  filtered.append((word, count))


print("Most common tokens:", token_counter.most_common(20))
print("Unique words:", len(token_counter))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Most common tokens: [('@', 7150), (':', 6589), ('RT', 5744), ('.', 3518), ('#', 2897), ('the', 2870), (',', 2863), ('…', 2614), ('to', 2538), ('a', 2221), ('I', 1984), ('and', 1853), ('you', 1786), ('of', 1486), ('for', 1439), ('is', 1415), ('in', 1370), ('-', 1171), ('!', 1161), ('on', 923)]
Unique words: 35180


## 5) Calculate idf weights

* Calculate idf weight for each word appearing in the data (one tweet = one document), and print top 20 words with lowest and highest idf values.
* Can you think of a reason why someone could claim that tf does not have a high impact when processing tweets?
* **The output of this excercise** should be a list of words sorted by their idf weights.


In [49]:
import math
# Number of tweets
m = len(seg_doc)
# New counter
# In how many tweets each word is present
df_counter = Counter()

# Loop through each line, split by word and use set to only get unique words per line
for i in range(0, len(seg_doc)):
  occurrences = set(seg_doc[i].split())
  df_counter.update(occurrences)
# Apply the df -> idf calculation to each word
for word, value in df_counter.items():
  df_counter[word] = math.log10(m/float(value))

# Highest and lowest idf values
print("Highest idf:", df_counter.most_common(20))
print("Lowest idf:", df_counter.most_common()[-21:-1])

# TF might not have a big impact on its own when processing tweets is because it often lacks
# context. If for example someone is doing a sentiment analysis about something, TF could make a term appear popular
# even though most of the tweets could be critizising it. TF also doesn't take into account synonyms.

Highest idf: [('https://t.co/pAvXn8diJr', 4.0), ('Partner', 4.0), ('Extending', 4.0), ('https://t.co/cu7on7g1si', 4.0), ('Blueberry', 4.0), ('🍨', 4.0), ('https://t.co/2gzHAFWYJY', 4.0), ('Chim', 4.0), ('@prologve_', 4.0), ('@BTS_ARMY', 4.0), ('cuddle', 4.0), ('Inn', 4.0), ('https://t.co/lXdFUm4qUb', 4.0), ('countryinns', 4.0), ('CampSprings', 4.0), ('PENELOPE', 4.0), ('https://t.co/1z1cgzvZxh', 4.0), ('CBCNews', 4.0), ('https://t.co/6R0nw7xDlL', 4.0), ('coldest', 4.0)]
Lowest idf: [('it', 1.090979145788844), ('-', 1.0888423912600234), ('on', 1.0665127121512945), ('!', 1.0065637695023884), ('in', 0.9118639112994488), ('is', 0.8992849134269184), ('of', 0.8794260687941502), ('for', 0.8738685927380156), ('you', 0.8520146793161949), ('I', 0.8193007987039653), ('#', 0.8181564120552274), ('and', 0.7804154737857453), ('a', 0.7228493860362033), (',', 0.6880343396316336), ('to', 0.6712127996454653), ('the', 0.6476245049994801), ('…', 0.5848596478041272), ('.', 0.5816987086802545), ('RT', 0.24764

## 6) Duplicates or near duplicates

* Check whether we have duplicate tweets (in terms of text field only) in our dataset. Duplicate tweet means here that the exactly same tweet text appears more than once in our dataset.
* Note: It makes sense to check the duplicates using original tweet texts as the texts were before segmentation. I would also recommend using the full 10,000 dataset here in order to get higher chance of seeing duplicates (this does not require heavy computing).
* Try to check whether tweets have additional near-duplicates. Near duplicate means here that tweet text is almost the same in two or more tweets. Ponder what kind of near duplicates there could be and how to find those. Start by considering for example different normalization techniques. Implement some of the techniques you considered.
* **The outcome of this exercise** should be a number of unique tweets in our dataset (with possibly counting also which are the most common duplicates) as well as the number of unique tweets after removing also near duplicates.

In [50]:
# Import the stemmer
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

# Counter for each task
unique_counter = Counter()
duplicate_counter = Counter()
true_unique = Counter()

# Use set to find only unique tweets
unique = set(tweets)
unique_counter.update(unique)
# Find most common full tweets, which means most common full duplicates
duplicate_counter.update(tweets)

# Loop through each tweet, split into words, steam each word, return to a new list and keep the original
# tweet stucture. stemmed = the original tweet with each word stemmed
stemmed = [" ".join([stemmer.stem(word) for word in sentence.split(" ")]) for sentence in tweets]
# Update counter with the stemmed tweets
true_unique.update(stemmed)

print("Number of unique tweets:", len(unique_counter))
print("Most common duplicates:", duplicate_counter.most_common(5))
print("Number of unique tweets after removing near duplicates:", len(true_unique))

Number of unique tweets: 9017
Most common duplicates: [('RT @SlushiiMusic: MIC DROP @BTS_twt https://t.co/5p1CArQuaO https://t.co/GnlvhJoetb', 30), ('RT @Louis_Tomlinson: Thank you so much for all the birthday messages and I hope everyone had a great Christmas ! Loads of love', 22), ('RT @lebaenesepapii: y’all could’ve just said that a transgender couple have a baby rather than giving me brain damage https://t.co/uVO2jEXL…', 15), ('RT @dril: my friend the only crypto currency you wanna get your hands on is this: bird seed. There is a lot of birds and they all gotta eat', 14), ('RT @GMA: SO excited for @BTS_twt to perform on @NYRE right here on ABC! #RockinEve https://t.co/QN5A3waARg', 14)]
Number of unique tweets after removing near duplicates: 9016
