## Learning Objectives

- How we can exctract keywords from corpus (collections of texts) using TF-IDF

- Explain what is TF-IDF

- Applications of keywords exctraction algorithm and Word2Vec

## Review: What are the pre-processings to apply a machine learning algorithm on text data?

1. The text must be parsed to words, called tokenization

2. Then the words need to be encoded as integers or floating point values

3. scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of text data

## What is TF-IDF Vectorizer?

- Word counts are a good starting point, but are very basic

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. 

**Term Frequency**: This summarizes how often a given word appears within a document

**Inverse Document Frequency**: This downscales words that appear a lot across documents

## Intuitive idea behind TF-IDF:
    
- If a word appears frequently in a document, it's important. Give the word a high score

- But if a word appears in many documents, it's not a unique identifier. Give the word a low score

<img src="Images/tfidf_slide.png" width="700" height="700">

## Activity: Obtain the keywords from TF-IDF

1- First obtain the TF-IDF matrix for given corpus

2- Do column-wise addition

3- Sort the score from highest to lowest

4- Return the associated words based on step 3

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import numpy as np

def keyword_sklearn(docs, k):
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(docs)
    print(tfidf_matrix.toarray())
    print(vectorizer.get_feature_names())
    tfidf_scores = np.sum(tfidf_matrix, axis=0)
    tfidf_scores = np.ravel(tfidf_scores)
    return sorted(dict(zip(vectorizer.get_feature_names(), tfidf_scores)).items(), key=lambda x: x[1], reverse=True)[:k]

documnets = ['The sky is bule', 'The sun is bright', 'The sun in the sky is bright', 'we can see the shining sun, the bright sun']

print(keyword_sklearn(documnets, 3))

[[0.         0.78528828 0.         0.6191303  0.        ]
 [0.70710678 0.         0.         0.         0.70710678]
 [0.53256952 0.         0.         0.65782931 0.53256952]
 [0.36626037 0.         0.57381765 0.         0.73252075]]
['bright', 'bule', 'shining', 'sky', 'sun']
[('sun', 1.9721970507561841), ('bright', 1.605936677684143), ('sky', 1.27695960978985)]


## Word2Vec

- Data Scientists have assigned a vector to each english word

- This process of assignning vectors to each word is called Word2Vec

- In DS 2.4, we will learn how they accomplished Word2Vec task

- Download this huge Word2Vec file: https://nlp.stanford.edu/projects/glove/

- Do not open the extracted file

## What is the property of vectors associated to each word in Word2Vec?

- Words with similar meanings would be closer to each other in Euclidean Space

- For example if $V_{pizza}$, $V_{food}$ and $V_{sport}$ represent the vector associated to pizza, food and sport then:

${\| V_{pizza} - V_{food}}\|$ < ${\| V_{pizza} - V_{sport}}\|$

## Acitivity: Obtain the vector associated to pizza in Glove

In [5]:
import codecs

with codecs.open('/Users/miladtoutounchian/Downloads/glove.840B.300d.txt', 'r') as f:
    for c, r in enumerate(f):
        sr = r.split()
        if sr[0] == 'pizza':
            print(sr[0])
            print([float(i) for i in sr[1:]])
            print(len([float(i) for i in sr[1:]]))
            break

pizza
[0.0068727, -0.21634, 0.27831, -0.26192, 0.22884, 0.89332, 0.4131, 0.27377, 0.22652, 1.5041, -0.58059, 0.56083, -0.18432, 0.27738, -0.10709, -0.13519, 0.023817, 1.1765, -0.12659, 0.043173, 0.23242, -0.63213, 0.40228, -0.20605, 0.46381, -0.12991, -0.68031, -0.010371, 0.50033, -0.32266, 0.24053, 0.40178, 0.12051, -0.13791, 0.40821, 0.54735, -0.25946, 0.020254, 0.21249, 0.91965, -0.21202, 0.66568, 0.25879, -0.36124, -0.10977, 0.87492, -0.089425, 0.39184, -0.32589, -0.22331, -0.17504, 0.074762, 0.45271, 0.085476, -0.079526, -0.23986, -0.010322, 0.089974, 0.29794, 0.26672, -0.044288, -0.082716, 0.20801, 0.38404, 0.15281, -1.1292, -0.094527, 0.16901, -0.018155, 0.31023, -0.095716, 0.32587, -0.2225, -0.040376, -0.52201, -0.040547, -0.2473, 0.059596, 0.31592, 0.48751, 0.14681, -0.29337, 0.61309, -0.7844, -0.16297, 0.042847, 0.90914, 0.70536, -0.44725, -0.3035, -0.26998, -0.32488, 0.10539, -0.24494, -0.023413, 0.51872, -0.0060798, -0.039611, 0.28618, 0.17071, -0.661, -0.1303, 0.59381, 0.3

## Activity: Obtain the vectors associated to pizza, food and sport in Glove

In [6]:
import codecs

with codecs.open('/Users/miladtoutounchian/Downloads/glove.840B.300d.txt', 'r') as f:
    ls = {}
    for c, r in enumerate(f):
        sr = r.split()
        if sr[0] in ['pizza', 'food', 'sport']:
            ls[sr[0]] =[float(i) for i in sr[1:]]
        if len(ls) == 3:
            break

print(ls) 

{'food': [-0.43512, 0.028351, 0.4911, -0.35168, -0.11578, 1.0369, -0.09755, 0.086624, -0.1789, 2.4555, -1.2798, 0.021074, -0.03225, 0.094673, -0.14, -0.52143, 0.00066447, 1.8051, -0.22604, 0.33227, 0.00041163, 0.062654, 0.14973, -0.5026, 0.089701, -0.26908, -0.083594, -0.16677, -0.17036, -0.32049, -0.23586, -0.40395, 0.32683, -0.21712, 0.098576, 0.47552, 0.092994, -0.061034, 0.12673, 0.60856, -0.0067936, -0.21831, 0.021751, -0.24858, -0.035244, 0.13692, -0.37109, 0.54421, 0.040017, 0.13992, 0.039967, -0.31745, 0.24408, -0.2355, 0.24884, -0.31929, 0.11282, -0.010198, -0.050538, -0.1155, 0.30273, -0.61441, 0.016135, 0.010675, 0.15108, -1.1759, 0.097104, 0.071706, 0.19795, 0.27253, -0.22122, 0.64478, -0.066252, -0.29403, 0.16281, -0.0078554, -0.14986, -0.11364, 0.36459, 0.13723, 0.46612, 0.26157, 0.0065022, -0.67068, -0.075247, -0.50802, -0.049202, 0.90222, -0.30085, 0.15453, -0.44762, -0.30997, -0.14006, -0.48079, 0.07838, -0.20951, -0.07558, -0.37064, 0.48714, -0.31549, -0.51954, -0.239

## Acitivty: Show that the vector of pizza is closer to vector of food than vector of sport

In [14]:
import numpy as np

np.linalg.norm(np.array(ls['pizza']) - np.array(ls['food']))

6.312737677708336

In [15]:
np.linalg.norm(np.array(ls['pizza']) - np.array(ls['sport']))

8.817056623492523

In [16]:
np.linalg.norm(np.array(ls['food']) - np.array(ls['sport']))

8.303718155175721