# Lab 7 - Textual Data Analytics
Complete the code with TODO tag.
## 1. Feature Engineering
In this exercise we will understand the functioning of TF/IDF ranking. Implement the feature engineering and its application, based on the code framework provided below.

First we use textual data from Twitter.

In [1]:
import numpy as np
import pandas as pd
from scipy import spatial

data = pd.read_csv('elonmusk_tweets.csv')
print(len(data))
data.head()

2819


Unnamed: 0,id,created_at,text
0,849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https:...
1,848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa..."
2,848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
3,848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...'
4,848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat ..."


In [2]:
# Some data cleaning

# Remove hyperlinks
data['text'] = data['text'].str.replace('http\S+|www.\S+', '', case=False)

# Remove leading 'b'
data['text'] = data['text'].str.replace('^b', '')

# Remove word 'RT'
data['text'] = data['text'].str.replace('RT', '')

data.head()

Unnamed: 0,id,created_at,text
0,849636868052275200,2017-04-05 14:56:29,'And so the robots spared humanity ...
1,848988730585096192,2017-04-03 20:01:01,"""@ForIn2020 @waltmossberg @mims @defcon_5 Exac..."
2,848943072423497728,2017-04-03 16:59:35,"'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
3,848935705057280001,2017-04-03 16:30:19,'Stormy weather in Shortville ...'
4,848416049573658624,2017-04-02 06:05:23,"""@DaveLeeBBC @verge Coal is dying due to nat g..."


### 1.1. Text Normalization
Now we need to normalize text by stemming, tokenizing, and removing stopwords.

In [3]:
from __future__ import print_function, division
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
nltk.download('punkt')
import string
from nltk.corpus import stopwords
import math
from collections import Counter
nltk.download('stopwords')
import pprint 
pp = pprint.PrettyPrinter(indent=4)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lucas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lucas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
def normalize(document):
    
    # TODO: Remove Punctuation AND tokenize text
    tokenizer = nltk.RegexpTokenizer(r"\w+")
    text = tokenizer.tokenize(document)

    # TODO: Stemming
    porter = PorterStemmer()
    ret = [porter.stem(w) for w in text]
    
    return ret

original_documents = [x.strip() for x in data['text']]

documents = [normalize(d) for d in original_documents]
print(documents[0])

['and', 'so', 'the', 'robot', 'spare', 'human']


As you can see that the normalization is still not perfect. Please feel free to improve upon (OPTIONAL), e.g. https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/

### 1.2. Implement TF-IDF
Now you need to implement TF-IDF, including creating the vocabulary, computing term frequency, and normalizing by tf-idf weights.

In [5]:
import itertools

# Flatten all the documents
flat_list = [word for doc in documents for word in doc]

# TODO: remove stop words from the vocabulary
stop_words = set(stopwords.words('english')) 

words = [w for w in flat_list if w not in stop_words]

# TODO: we take the 500 most common words only
counts = Counter(words)

# Copy our counts into a sorted (decending order) dictionary
vocabulary = {k: v for k, v in sorted(counts.items(), key=lambda item: item[1], reverse=True)}

# Slice the top 500 values
vocabulary = dict(itertools.islice(vocabulary.items(), 500))

# Copy over our words (without values)
vocabulary = [x for x in vocabulary]
vocabulary.sort()

print(vocabulary)

['0', '000', '1', '10', '100', '11', '1st', '2', '3', '30', '4', '40', '5', '6', '60', '7', '8', '9', 'A', 'AI', 'Am', 'ET', 'I', 'If', 'In', 'It', 'LA', 'My', 'NY', 'No', 'Of', 'S', 'T', 'To', 'US', 'We', 'X', 'abort', 'achiev', 'action', 'actual', 'ad', 'advanc', 'agre', 'aim', 'air', 'alien', 'allow', 'almost', 'alreadi', 'also', 'alway', 'amaz', 'amp', 'ani', 'announc', 'anoth', 'anyth', 'appreci', 'around', 'articl', 'ask', 'attempt', 'auto', 'autopilot', 'away', 'awesom', 'babi', 'back', 'bad', 'badastronom', 'base', 'batteri', 'beauti', 'befor', 'believ', 'best', 'better', 'big', 'bit', 'booster', 'break', 'bring', 'btw', 'build', 'burn', 'busi', 'california', 'call', 'camera', 'canaver', 'cape', 'car', 'carbon', 'care', 'case', 'caus', 'center', 'chang', 'charg', 'check', 'china', 'climat', 'close', 'coast', 'come', 'comment', 'compani', 'competit', 'complet', 'complex', 'confirm', 'consum', 'control', 'cool', 'cost', 'could', 'countri', 'cours', 'cover', 'crazi', 'creat', 'cri

## TF (Term Frequency)

The number of times a term appears in a document


In [6]:
def tf(vocabulary, documents):
    matrix = [0] * len(documents)
    for i, document in enumerate(documents):
        counts = Counter(document)
        matrix[i] = [0] * len(vocabulary)
        for j, term in enumerate(vocabulary):
            matrix[i][j] = counts[term]
    return matrix

tf = tf(vocabulary, documents)
np.array(vocabulary)[np.where(np.array(tf[1]) > 0)], np.array(tf[1])[np.where(np.array(tf[1]) > 0)]

(array(['base', 'exactli', 'tesla', 'x80', 'xa6', 'xe2'], dtype='<U15'),
 array([1, 1, 1, 1, 1, 1]))

## IDF

log (Number of docs / Number of docs that contain a term)

In [7]:
def idf(vocabulary, documents):
    """TODO: compute IDF, storing values in a dictionary"""
    idf = {}
    
    N = len(documents)
    
    
    for vocab_w in vocabulary:
        times_appeared = 0
        
        # Find how many documents have that word
        for doc in documents:
            for doc_w in doc:
                if vocab_w == doc_w:
                    times_appeared += 1
                    break
            
        
        # Apply our calculation
        idf[vocab_w] = math.log(N / times_appeared)
    
    return idf

idf = idf(vocabulary, documents)
[idf[key] for key in vocabulary[:5]]


# Printing our some of the IDF scores
for i, k in enumerate(idf):
    print("(", k ,"|", idf[k],")")
    if i > 30:
        break


( 0 | 4.417776966497951 )
( 000 | 5.305080161498854 )
( 1 | 3.478229372459529 )
( 10 | 4.5429401094519575 )
( 100 | 5.053765733217948 )
( 11 | 5.459230841326113 )
( 1st | 4.808643275184963 )
( 2 | 3.754482749087687 )
( 3 | 3.513320692270799 )
( 30 | 4.853095037755796 )
( 4 | 4.447629929647633 )
( 40 | 5.305080161498854 )
( 5 | 3.8836944805676934 )
( 6 | 4.447629929647633 )
( 60 | 4.853095037755796 )
( 7 | 5.053765733217948 )
( 8 | 4.611932980938908 )
( 9 | 3.537418243849859 )
( A | 4.1829373754205506 )
( AI | 4.611932980938908 )
( Am | 4.417776966497951 )
( ET | 4.766083660766167 )
( I | 2.4974001194478026 )
( If | 4.333219578469889 )
( In | 4.766083660766167 )
( It | 3.4897901948606047 )
( LA | 4.611932980938908 )
( My | 4.280575844984466 )
( NY | 5.546242218315742 )
( No | 4.115496094625017 )
( Of | 5.379188133652576 )
( S | 2.838192017213532 )


In [8]:
def vectorize(document, vocabulary, idf):
    vector = [0]*len(vocabulary)
    counts = Counter(document)
    for i,term in enumerate(vocabulary):
        vector[i] = idf[term] * counts[term]
    return vector

document_vectors = [vectorize(s, vocabulary, idf) for s in documents]
np.array(vocabulary)[np.where(np.array(document_vectors[1]) > 0)], np.array(document_vectors[1])[np.where(np.array(document_vectors[1]) > 0)]

(array(['base', 'exactli', 'tesla', 'x80', 'xa6', 'xe2'], dtype='<U15'),
 array([5.54624222, 4.47840159, 2.09481271, 2.67627933, 3.24365713,
        2.62112751]))

### 1.3. Compare the results with the reference implementation of scikit-learn library.

Now we use the scikit-learn library. As you can see that, the way we do text normalization affects the result. Feel free to further improve upon (OPTIONAL), e.g. https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn

In [9]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english', max_features=500)

features = tfidf.fit(original_documents)
corpus_tf_idf = tfidf.transform(original_documents) 

sum_words = corpus_tf_idf.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in tfidf.vocabulary_.items()]
print(sorted(words_freq, key = lambda x: x[1], reverse=True)[:5])
print('testla', corpus_tf_idf[1, features.vocabulary_['tesla']])

[('tesla', 101.90986835463575), ('model', 79.77144565643432), ('spacex', 75.61689890838036), ('yes', 65.62323278091372), ('teslamotors', 64.90757497311142)]
testla 0.3393247403223787


### 1.4.  Apply TF-IDF for information retrieval
We can use the vector representation of documents to implement an information retrieval system. We test with the query $Q$ = "tesla nasa"

In [10]:
lst = [0.0, 0.01, 0.02, 0.0, 0.1, 0.0003]
sorted(lst)

[0.0, 0.0, 0.0003, 0.01, 0.02, 0.1]

In [18]:
def cosine_similarity(v1,v2):
    """TODO: compute cosine similarity"""
    #sumxx, sumxy, sumyy = 0, 0, 0
    
    result = 1 - spatial.distance.cosine(v1, v2)
    
    return result

def search_vec(query, k, vocabulary, stemmer, document_vectors, original_documents):
    q = query.split()
    q = [stemmer.stem(w) for w in q]
    
    query_vector = vectorize(q, vocabulary, idf)
    
    
    # TODO: rank the documents by cosine similarity
    scores = []
    
    for k, document_vector in enumerate(document_vectors):
        sim = cosine_similarity(query_vector, document_vector)
        score = [sim, k]
        scores.append(score)
    
    # BUG: Not sorting properly
    scores = sorted(scores, key=lambda item: item[0], reverse=True)
    
    print('Top-{0} documents'.format(k))
    for i in range(k):
        print(i, original_documents[scores[i][1]])

query = "tesla nasa"
stemmer = PorterStemmer()
search_vec(query, 5, vocabulary, stemmer, document_vectors, original_documents)

Top-2818 documents
0 '@ashwin7002 @NASA @faa @AFPAA We have not ruled that out.'
1 ' @jehldavid: Beautiful up to date album of @SpaceX Dragon spacecraft photos released by @NASA. @elonmusk
2 "@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6
3 'And so the robots spared humanity ...
4 '@waltmossberg @mims @defcon_5 Et tu, Walt?'
5 '@business Glad to have Tencent as an investor and advisor to Tesla'
6 'Made today on Tesla sketch pad
7 '@nddugan Yup'
8 " @Shkottt: @elonmusk my Tesla Roadster is the best car i've ever had"
9 " @RicardoTwumasi: @elonmusk My Tesla Model S is the best car I've ever had
10 ' @arctechinc: @elonmusk we believe so much in Tesla and its goals we have a 100D Model S and 90D Model X and love both. Thank you!
11 " @Herifin_teki: @elonmusk My lil monster Tesla Model X is the best car I've ever used. Thank you Elon. \xe2\x9d\xa4\xef\xb8\x8f"
12 'Stormy weather in Shortville ...'
13 "@DaveLeeB

856 '@_CraigR @FortuneMagazine exactly'
857 "@wfederman @stegen @FortuneMagazine that wasn't criticism, just clarification"
858 '@SwiftOnSecurity why?'
859 '@wfederman @stegen @FortuneMagazine thought I was posting Reuters. Copied the wrong link.'
860 '@vicentes Probably six months. Will include hundreds of refinements to handle rare corner cases in Autopilot.'
861 "@stegen @FortuneMagazine you're right, no point"
862 '@chirag Use of word "beta" is explicitly so that drivers don\'t get comfortable. It is not beta software in the standard sense.'
863 '@BrooklynBrett @LeoOD3 @Conleich Story was fed to LAT &amp; many other media to counter IMF $5T/year fossil subsidy study. Hey, both have a 5!'
864 "Love reddit. Gandel gets shredded &amp; won't even answer top 3 upvoted questions. Some really funny comments about Koch
865 'Model S had the lowest probability of injury of any car ever tested by NHTSA, which is why ...
866 " @DougEatwell: Elon Musk's Unbelievably Simple 12-minute Killer Brea

1307 'Some exciting news this week: Tesla Version 7 software with Autopilot goes to wide release on Thursday!'
1308 '@aaronpaul_8 sure :)'
1309 '@elimelechweiss actually works best at night'
1310 'Review of Model S by @MrTeller
1311 '@roymoody Depends on regulatory approval, but hopefully end of next week'
1312 '@MRamseyWSJ 1.5 million miles per day'
1313 '@aikisteve @andrewshiamone V7.1'
1314 '@lordsshrivas Lots of upgrades and a new look, although main UI upgrade coming with 7.1'
1315 '@madolfsson roughly 5 days'
1316 '@TeslaPittsburgh Non-autopilot will have a new interface too. More comprehensive UI update coming with 7.1.'
1317 '@andrewshiamone yes'
1318 'CH4 rapidly decays back to CO2 &amp; is absorbed by plants. What matters is adding new carbon to surface cycle from underground oil, gas &amp; coal.'
1319 "@HamiltonOfiyai Intentions are good, but massively overweights CH4's effect on climate."
1320 "@appleinsider I didn't walk back anything, apart from media hype"
1321 "Regardin

2051 '@DebbieViviers @SpaceX Yes, upper stage venting of liquid oxygen created a fast moving fuzzy white sphere in space over SA'
2052 'Between this flight &amp; Grasshopper tests, I think we now have all the pieces of the puzzle to bring the rocket back home.'
2053 'Rocket booster relit twice (supersonic retro &amp; landing), but spun up due to aero torque, so fuel centrifuged &amp; we flamed out'
2054 'Launch was good. All satellites deployed at the targeted orbit insertion vectors.
2055 'You can watch the launch at
2056 'Looks good for a launch attempt today. Upper winds slightly exceed loads in high subsonic regime, but improving.'
2057 'Water ice on Mars
2058 ' @ID_AA_Carmack: @elonmusk @BadAstronomer We actually are outside some of the IPCC prediction ranges from 20 years ago
2059 'Sensible piece on status of climate change by @BadAstronomer
2060 'Falcon 9 launch window is Sunday and Monday, assuming good weather at Vandenberg Air Force Base'
2061 'Good progress harnessing that b

2707 'Happy bday to my old and dear friend @adeoressi! U do parties better than a rockstar. For Berlin ...
2708 'The Model S beta endurance car just passed 150,000 miles on a single battery pack!'
2709 'Interesting interview with Vinge about superhuman AI and optimistic apocalypses
2710 'Support @Polar_Broadband as they try to save a good satellite from becoming space junk. Let it live!  #antarctica'
2711 ' @SpaceX: In case you missed it, here is the full video of SpaceX on @60Minutes last night. What did you think of the story?  ...'
2712 'Lovely poster about wishes that explains one of the many reasons to make life multiplanetary
2713 '10 years ago today, SpaceX was founded. Many battles fought. Physics always won.
2714 "Why I'm leaving the Empire, by Darth Vader
2715 'If something is physically possible, not only is someone doing it, but there is also an award show.'
2716 'Mountain lions eat cats, which means we *actually* live in a cat eat cat world ... an apology is owed to dogkin

We can also use the scikit-learn library to do the retrieval.

In [16]:
new_features = tfidf.transform([query])

cosine_similarities = linear_kernel(new_features, corpus_tf_idf).flatten()
related_docs_indices = cosine_similarities.argsort()[::-1]

topk = 5
print('Top-{0} documents'.format(topk))
for i in range(topk):
    print(i, original_documents[related_docs_indices[i]])

Top-5 documents
0 '@ashwin7002 @NASA @faa @AFPAA We have not ruled that out.'
1 "SpaceX could not do this without NASA. Can't express enough appreciation.
2 '@NASA launched a rocket into the northern lights
3 'Whatever happens today, we could not have done it without @NASA, but errors are ours alone and me most of all.'
4 ' @NASA: Updated @SpaceX #Dragon #ISS rendezvous times: NASA TV coverage begins Sunday at 3:30amET:  Grapple at  ...'
