# Lab 7 - Textual Data Analytics
Complete the code with TODO tag.
## 1. Feature Engineering
In this exercise we will understand the functioning of TF/IDF ranking. Implement the feature engineering and its application, based on the code framework provided below.

First we use textual data from Twitter.

In [24]:
import numpy as np
import pandas as pd
from scipy import spatial

data = pd.read_csv('elonmusk_tweets.csv')
print(len(data))
data.head()

2819


Unnamed: 0,id,created_at,text
0,849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https:...
1,848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa..."
2,848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
3,848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...'
4,848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat ..."


In [25]:
# Some data cleaning

# Remove hyperlinks
data['text'] = data['text'].str.replace('http\S+|www.\S+', '', case=False)

# Remove leading 'b'
data['text'] = data['text'].str.replace('^b', '')

# Remove word 'RT'
data['text'] = data['text'].str.replace('RT', '')

data.head()

Unnamed: 0,id,created_at,text
0,849636868052275200,2017-04-05 14:56:29,'And so the robots spared humanity ...
1,848988730585096192,2017-04-03 20:01:01,"""@ForIn2020 @waltmossberg @mims @defcon_5 Exac..."
2,848943072423497728,2017-04-03 16:59:35,"'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
3,848935705057280001,2017-04-03 16:30:19,'Stormy weather in Shortville ...'
4,848416049573658624,2017-04-02 06:05:23,"""@DaveLeeBBC @verge Coal is dying due to nat g..."


### 1.1. Text Normalization
Now we need to normalize text by stemming, tokenizing, and removing stopwords.

In [26]:
from __future__ import print_function, division
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
nltk.download('punkt')
import string
from nltk.corpus import stopwords
import math
from collections import Counter
nltk.download('stopwords')
import pprint 
pp = pprint.PrettyPrinter(indent=4)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lucas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lucas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [27]:
def normalize(document):
    
    # TODO: Remove Punctuation AND tokenize text
    tokenizer = nltk.RegexpTokenizer(r"\w+")
    text = tokenizer.tokenize(document)

    # TODO: Stemming
    porter = PorterStemmer()
    ret = [porter.stem(w) for w in text]
    
    return ret

original_documents = [x.strip() for x in data['text']]

documents = [normalize(d) for d in original_documents]
print(documents[0])

['and', 'so', 'the', 'robot', 'spare', 'human']


As you can see that the normalization is still not perfect. Please feel free to improve upon (OPTIONAL), e.g. https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/

### 1.2. Implement TF-IDF
Now you need to implement TF-IDF, including creating the vocabulary, computing term frequency, and normalizing by tf-idf weights.

In [28]:
import itertools

# Flatten all the documents
flat_list = [word for doc in documents for word in doc]

# TODO: remove stop words from the vocabulary
stop_words = set(stopwords.words('english')) 

words = [w for w in flat_list if w not in stop_words]

# TODO: we take the 500 most common words only
counts = Counter(words)

# Copy our counts into a sorted (decending order) dictionary
vocabulary = {k: v for k, v in sorted(counts.items(), key=lambda item: item[1], reverse=True)}

# Slice the top 500 values
vocabulary = dict(itertools.islice(vocabulary.items(), 500))

# Copy over our words (without values)
vocabulary = [x for x in vocabulary]
vocabulary.sort()

print(vocabulary)

['0', '000', '1', '10', '100', '11', '1st', '2', '3', '30', '4', '40', '5', '6', '60', '7', '8', '9', 'A', 'AI', 'Am', 'ET', 'I', 'If', 'In', 'It', 'LA', 'My', 'NY', 'No', 'Of', 'S', 'T', 'To', 'US', 'We', 'X', 'abort', 'achiev', 'action', 'actual', 'ad', 'advanc', 'agre', 'aim', 'air', 'alien', 'allow', 'almost', 'alreadi', 'also', 'alway', 'amaz', 'amp', 'ani', 'announc', 'anoth', 'anyth', 'appreci', 'around', 'articl', 'ask', 'attempt', 'auto', 'autopilot', 'away', 'awesom', 'babi', 'back', 'bad', 'badastronom', 'base', 'batteri', 'beauti', 'befor', 'believ', 'best', 'better', 'big', 'bit', 'booster', 'break', 'bring', 'btw', 'build', 'burn', 'busi', 'california', 'call', 'camera', 'canaver', 'cape', 'car', 'carbon', 'care', 'case', 'caus', 'center', 'chang', 'charg', 'check', 'china', 'climat', 'close', 'coast', 'come', 'comment', 'compani', 'competit', 'complet', 'complex', 'confirm', 'consum', 'control', 'cool', 'cost', 'could', 'countri', 'cours', 'cover', 'crazi', 'creat', 'cri

## TF (Term Frequency)

The number of times a term appears in a document


In [29]:
def tf(vocabulary, documents):
    matrix = [0] * len(documents)
    for i, document in enumerate(documents):
        counts = Counter(document)
        matrix[i] = [0] * len(vocabulary)
        for j, term in enumerate(vocabulary):
            matrix[i][j] = counts[term]
    return matrix

tf = tf(vocabulary, documents)
np.array(vocabulary)[np.where(np.array(tf[1]) > 0)], np.array(tf[1])[np.where(np.array(tf[1]) > 0)]

(array(['base', 'exactli', 'tesla', 'x80', 'xa6', 'xe2'], dtype='<U15'),
 array([1, 1, 1, 1, 1, 1]))

## IDF

log (Number of docs / Number of docs that contain a term)

In [30]:
def idf(vocabulary, documents):
    """TODO: compute IDF, storing values in a dictionary"""
    idf = {}
    
    N = len(documents)
    
    
    for vocab_w in vocabulary:
        times_appeared = 0
        
        # Find how many documents have that word
        for doc in documents:
            for doc_w in doc:
                if vocab_w == doc_w:
                    times_appeared += 1
                    break
            
        
        # Apply our calculation
        idf[vocab_w] = math.log(N / times_appeared)
    
    return idf

idf = idf(vocabulary, documents)
[idf[key] for key in vocabulary[:5]]


# Printing our some of the IDF scores
for i, k in enumerate(idf):
    print("(", k ,"|", idf[k],")")
    if i > 30:
        break


( 0 | 4.417776966497951 )
( 000 | 5.305080161498854 )
( 1 | 3.478229372459529 )
( 10 | 4.5429401094519575 )
( 100 | 5.053765733217948 )
( 11 | 5.459230841326113 )
( 1st | 4.808643275184963 )
( 2 | 3.754482749087687 )
( 3 | 3.513320692270799 )
( 30 | 4.853095037755796 )
( 4 | 4.447629929647633 )
( 40 | 5.305080161498854 )
( 5 | 3.8836944805676934 )
( 6 | 4.447629929647633 )
( 60 | 4.853095037755796 )
( 7 | 5.053765733217948 )
( 8 | 4.611932980938908 )
( 9 | 3.537418243849859 )
( A | 4.1829373754205506 )
( AI | 4.611932980938908 )
( Am | 4.417776966497951 )
( ET | 4.766083660766167 )
( I | 2.4974001194478026 )
( If | 4.333219578469889 )
( In | 4.766083660766167 )
( It | 3.4897901948606047 )
( LA | 4.611932980938908 )
( My | 4.280575844984466 )
( NY | 5.546242218315742 )
( No | 4.115496094625017 )
( Of | 5.379188133652576 )
( S | 2.838192017213532 )


In [31]:
def vectorize(document, vocabulary, idf):
    vector = [0]*len(vocabulary)
    counts = Counter(document)
    for i,term in enumerate(vocabulary):
        vector[i] = idf[term] * counts[term]
    return vector

document_vectors = [vectorize(s, vocabulary, idf) for s in documents]
np.array(vocabulary)[np.where(np.array(document_vectors[1]) > 0)], np.array(document_vectors[1])[np.where(np.array(document_vectors[1]) > 0)]

(array(['base', 'exactli', 'tesla', 'x80', 'xa6', 'xe2'], dtype='<U15'),
 array([5.54624222, 4.47840159, 2.09481271, 2.67627933, 3.24365713,
        2.62112751]))

### 1.3. Compare the results with the reference implementation of scikit-learn library.

Now we use the scikit-learn library. As you can see that, the way we do text normalization affects the result. Feel free to further improve upon (OPTIONAL), e.g. https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn

In [32]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english', max_features=500)

features = tfidf.fit(original_documents)
corpus_tf_idf = tfidf.transform(original_documents) 

sum_words = corpus_tf_idf.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in tfidf.vocabulary_.items()]
print(sorted(words_freq, key = lambda x: x[1], reverse=True)[:5])
print('testla', corpus_tf_idf[1, features.vocabulary_['tesla']])

[('tesla', 101.90986835463575), ('model', 79.77144565643432), ('spacex', 75.61689890838036), ('yes', 65.62323278091372), ('teslamotors', 64.90757497311142)]
testla 0.3393247403223787


### 1.4.  Apply TF-IDF for information retrieval
We can use the vector representation of documents to implement an information retrieval system. We test with the query $Q$ = "tesla nasa"

In [95]:
lst = [0.0, 0.01, 0.02, 0.0, 0.1, 0.0003]
sorted(lst)

[0.0, 0.0, 0.0003, 0.01, 0.02, 0.1]

In [107]:
def cosine_similarity(v1,v2):
    """TODO: compute cosine similarity"""
    #sumxx, sumxy, sumyy = 0, 0, 0
    
    result = 1 - spatial.distance.cosine(v1, v2)
    
    return result

def search_vec(query, k, vocabulary, stemmer, document_vectors, original_documents):
    q = query.split()
    q = [stemmer.stem(w) for w in q]
    
    query_vector = vectorize(q, vocabulary, idf)
    
    
    # TODO: rank the documents by cosine similarity
    scores = []
    
    for k, document_vector in enumerate(document_vectors):
        sim = cosine_similarity(query_vector, document_vector)
        score = [sim, k]
        scores.append(score)
    
    scores_sorted = sorted(scores, key=lambda item: item[0], reverse=True)
    
    for i in scores_sorted:
        print(i)
        
    
    
    print('Top-{0} documents'.format(k))
    for i in range(k):
        print(i, original_documents[scores[i][1]])

query = "tesla nasa"
stemmer = PorterStemmer()
search_vec(query, 5, vocabulary, stemmer, document_vectors, original_documents)

[0.6556099887943234, 688]
[0.2855803613618826, 1387]
[0.10978511330286611, 1]
[0.0, 0]
[nan, 2]
[0.16537777305682966, 32]
[0.11870839303078262, 28]
[nan, 44]
[0.1278385688713224, 62]
[0.09923936622109564, 61]
[0.07654143313214556, 54]
[0.07168858138064382, 56]
[0.0, 3]
[0.0, 4]
[0.0, 5]
[0.0, 6]
[0.0, 7]
[0.0, 8]
[0.0, 9]
[0.0, 10]
[0.0, 11]
[0.0, 12]
[0.0, 13]
[0.0, 14]
[0.0, 15]
[0.0, 16]
[0.0, 17]
[0.0, 18]
[0.0, 19]
[0.0, 20]
[0.0, 21]
[0.0, 22]
[0.0, 23]
[0.0, 24]
[0.0, 25]
[0.0, 26]
[0.0, 27]
[0.0, 29]
[0.0, 30]
[0.0, 31]
[0.0, 33]
[0.0, 34]
[0.0, 35]
[0.0, 36]
[0.0, 37]
[0.0, 38]
[0.0, 39]
[0.0, 40]
[0.0, 41]
[0.0, 42]
[0.0, 43]
[0.0, 45]
[0.0, 46]
[0.0, 47]
[0.0, 48]
[0.0, 49]
[0.0, 50]
[0.0, 51]
[0.0, 52]
[0.0, 53]
[0.0, 55]
[0.0, 57]
[0.0, 58]
[0.0, 59]
[0.0, 60]
[0.0, 63]
[0.0, 64]
[0.0, 65]
[nan, 75]
[nan, 82]
[0.3183895887915966, 101]
[0.2714201105010787, 807]
[0.18380530081502533, 670]
[0.11036003481287393, 678]
[0.10352055744372635, 674]
[0.09349376090782713, 675]
[0.0, 

[0.46804616566465507, 1772]
[0.46804616566465507, 1776]
[0.3169280989919556, 1809]
[0.3005688024440173, 1790]
[0.2253537189649355, 1742]
[0.15689097361563797, 1815]
[0.1142756518149346, 1756]
[0.11298117011268571, 1795]
[0.08908567136966572, 1786]
[0.08596423382516893, 1805]
[0.08044419488628163, 1787]
[0.07252756030159102, 1774]
[0.06613022540990454, 1757]
[0.0, 1739]
[0.0, 1740]
[0.0, 1741]
[0.0, 1743]
[0.0, 1744]
[0.0, 1745]
[0.0, 1746]
[0.0, 1747]
[0.0, 1748]
[0.0, 1749]
[0.0, 1750]
[0.0, 1751]
[0.0, 1752]
[0.0, 1754]
[0.0, 1755]
[0.0, 1758]
[0.0, 1759]
[0.0, 1760]
[0.0, 1761]
[0.0, 1762]
[0.0, 1763]
[0.0, 1764]
[0.0, 1765]
[0.0, 1766]
[0.0, 1767]
[0.0, 1768]
[0.0, 1769]
[0.0, 1770]
[0.0, 1771]
[0.0, 1773]
[0.0, 1775]
[0.0, 1777]
[0.0, 1778]
[0.0, 1779]
[0.0, 1780]
[0.0, 1781]
[0.0, 1782]
[0.0, 1783]
[0.0, 1784]
[0.0, 1785]
[0.0, 1788]
[0.0, 1789]
[0.0, 1791]
[0.0, 1792]
[0.0, 1793]
[0.0, 1794]
[0.0, 1796]
[0.0, 1797]
[0.0, 1798]
[0.0, 1799]
[0.0, 1800]
[0.0, 1801]
[0.0, 1802]
[0.0

114 "@Legit_bacon @mcannonbrookes Looking forward to it. I know it's cliche, but LotR is my favorite book ever :) Want t\xe2\x80\xa6
115 '@5AllanLeVito @mcannonbrookes @VGroysman Sure. Just check out my prior tweet on pricing ($250 kWh at pack level). W\xe2\x80\xa6
116 '.@mcannonbrookes Can only happen with your support, and working closely with key govt and utility leaders who are s\xe2\x80\xa6
117 'Just wanted to write a note of appreciation to the many Australians who came out in support of the battery plan, especially @mcannonbrookes'
118 '@ah_pton16 I love Wikipedia. Just gets better over time.'
119 '@JMMZHerrera Answer is complex for electric motors. We use an AC induction motor fed by a DC pack thru an IGBT inve\xe2\x80\xa6
120 '@williamwinters High voltage DC is for sure the best way to transmit electricity over long distances. Good explanat\xe2\x80\xa6
121 'Ironically, direct current is the right approach today, even though alternating was right in the past. Solar power &amp; 

782 '@drehmer I am!'
783 'Returning from Cape Canaveral to California...'
784 'Really tempting to redesign upper stage for return too (Falcon Heavy has enough power), but prob best to stay focused on the Mars rocket'
785 "Can't wait to see all three cores of Falcon Heavy come back for landings! First two will be almost simultaneous.
786 'Landing
787 '@PaulPrijs @alexfrance free Babel fish provided'
788 'Bit of a distraction yesterday. Working on plan today.'
789 'Out on LZ-1. We just completed the post-landing inspection and all systems look good. Ready to fly again.
790 'Falcon on LZ-1 at Cape Canaveral
791 ' @SpaceX: Dragon on its way to @Space_Station, Falcon on its way home
792 '@BilalNaseer Thanks. Exactly. Some in the press think they\'re so much smarter than Tesla owners that "they know better". Makes no sense.'
793 '@MacTechGenius V8 will be our biggest release since v1, so taking longer to refine. Awesome on every level. Meeting w design team every day.'
794 "Tesla customers a

1428 '@roymoody Depends on regulatory approval, but hopefully end of next week'
1429 '@MRamseyWSJ 1.5 million miles per day'
1430 '@aikisteve @andrewshiamone V7.1'
1431 '@lordsshrivas Lots of upgrades and a new look, although main UI upgrade coming with 7.1'
1432 '@heosat ww'
1433 '@madolfsson roughly 5 days'
1434 '@MPa81 @bonnienorman same'
1435 '@TeslaPittsburgh Non-autopilot will have a new interface too. More comprehensive UI update coming with 7.1.'
1436 '@andrewshiamone yes'
1437 'Some exciting news this week: Tesla Version 7 software with Autopilot goes to wide release on Thursday!'
1438 'CH4 rapidly decays back to CO2 &amp; is absorbed by plants. What matters is adding new carbon to surface cycle from underground oil, gas &amp; coal.'
1439 "@HamiltonOfiyai Intentions are good, but massively overweights CH4's effect on climate."
1440 'Tesla P85D (assembled in Tilburg) beats 650 hp MacLaren from 0 to 150km/h in Dutch magazine test
1441 "@appleinsider I didn't walk back anything, 

1982 ' @TeslaMotors: NHTSA has reaffirmed the 5-star safety rating of Model S for 2014, confirming the highest safety rating in America!
1983 '@simonhackett Yes, definitely needs a huge amount of battery capacity. ~1/3 of energy produced during day must be stored'
1984 'This is why I think solar power will be the primary long term solution
1985 '@arcaresenal true'
1986 '220 ft tall mobile crane to carry the rocket around in the vertical
1987 " @TeslaMotors: Learn what it's like to own Model S in the winter and how it performs in the sub-zero temperatures of Norway:
1988 'SolarCity launches Give Power Foundation
1989 ' @bonnienorman: @teslaMotors @elonmusk Santa came early this year for a little girl in Texas, thanks to #Tesla.   #\xe2\x80\xa6'
1990 'Sock monkey of destruction!
1991 'Rest in peace, Nelson Mandela. A man both good and great.'
1992 'Should mention that the battery cells used for this are 200 Wh/kg vs 250 for Model S. No short term supply constraint.'
1993 'Solar power w b

2491 ' @TheEconomist: Mexico is changing in ways that will profoundly affect America
2492 "Can't put my finger on it, but for some reason the newsstand is looking particularly good right now
2493 'But if humanity wishes to become a multi-planet species, then we must figure out how to move millions of people to Mars.'
2494 'And, yes, I do in fact know that this sounds crazy. That is not lost on me. Nor I do think SpaceX will do this alone.'
2495 'Millions of people needed for Mars colony, so 80k+ would just be the number moving to Mars per year
2496 'Short vid of the recent @SpaceX mission to the Intl Space Station
2497 '@cthorm SS Heart of Gold powered by the Infinite Improbability Drive!'
2498 'My talk for the Royal Aeronautical Society is now online. This gets a little more technical.
2499 'Liam Neeson\'s "Life\'s Too Short" sketch is super funny
2500 " @neokoenig: @elonmusk if anyone has issues accessing the video, I've mirrored it on YouTube here:
2501 'On the plus side...\xe2\x80\

We can also use the scikit-learn library to do the retrieval.

In [None]:
new_features = tfidf.transform([query])

cosine_similarities = linear_kernel(new_features, corpus_tf_idf).flatten()
related_docs_indices = cosine_similarities.argsort()[::-1]

topk = 5
print('Top-{0} documents'.format(topk))
for i in range(topk):
    print(i, original_documents[related_docs_indices[i]])