<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#From-Scratch" data-toc-modified-id="From-Scratch-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>From Scratch</a></span></li><li><span><a href="#sklearn" data-toc-modified-id="sklearn-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>sklearn</a></span></li><li><span><a href="#TF-IDF" data-toc-modified-id="TF-IDF-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>TF-IDF</a></span></li></ul></div>

Some sources I referenced while working:

* [How to extract keywords from text with TF-IDF and Python’s Scikit-Learn](https://www.freecodecamp.org/news/how-to-extract-keywords-from-text-with-tf-idf-and-pythons-scikit-learn-b2a0f3d7e667/)
* [TFIDF/TFIDF.ipynb at master · mayank408/TFIDF · GitHub](https://github.com/mayank408/TFIDF/blob/master/TFIDF.ipynb)
* [How to process textual data using TF-IDF in Python](https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/)

In [1]:
import re, itertools, pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [29]:
# Our string:
text = open('texts/legend.txt', 'r').read()

# String into list of words:
sentences = text.split(".")

print(sentences)

["Nah I ain't scared to go over there", " I don't think it's none to be scared of", " But, but it's still strange to hear", " I think them old people say that spirit will not let you dig to get to the gold, whatever's down there", " That's what them old people used to say", " It's not going let ya", " And it's kinda strange they clean all that", ' They clean all the distance, but they never clean that spot there', '']


The last item in the list above is empty. To get rid of it:

    del sentences[-1]

I then rechecked the length of the list, `len(sentences)` and saw that it had dropped from `9` to `8`. 

In [35]:
del sentences[-1]
len(sentences)

8

## From Scratch

In [9]:
bows = []
for sentence in sentences:
    words = re.sub("[^a-zA-Z']"," ", sentence).lower().split()
    bows.append(words)

In [15]:
for item in bows:
    print(str(len(item)) + ": " + str(item))

8: ['nah', 'i', "ain't", 'scared', 'to', 'go', 'over', 'there']
9: ['i', "don't", 'think', "it's", 'none', 'to', 'be', 'scared', 'of']
7: ['but', 'but', "it's", 'still', 'strange', 'to', 'hear']
21: ['i', 'think', 'them', 'old', 'people', 'say', 'that', 'spirit', 'will', 'not', 'let', 'you', 'dig', 'to', 'get', 'to', 'the', 'gold', "whatever's", 'down', 'there']
8: ["that's", 'what', 'them', 'old', 'people', 'used', 'to', 'say']
5: ["it's", 'not', 'going', 'let', 'ya']
8: ['and', "it's", 'kinda', 'strange', 'they', 'clean', 'all', 'that']
12: ['they', 'clean', 'all', 'the', 'distance', 'but', 'they', 'never', 'clean', 'that', 'spot', 'there']


In [30]:
counts = {}
for item in bows:
    for word in item:
        try:
            counts[word] += 1
        except: 
            counts[word] = 1

print(counts)

{'nah': 1, 'i': 3, "ain't": 1, 'scared': 2, 'to': 6, 'go': 1, 'over': 1, 'there': 3, "don't": 1, 'think': 2, "it's": 4, 'none': 1, 'be': 1, 'of': 1, 'but': 3, 'still': 1, 'strange': 2, 'hear': 1, 'them': 2, 'old': 2, 'people': 2, 'say': 2, 'that': 3, 'spirit': 1, 'will': 1, 'not': 2, 'let': 2, 'you': 1, 'dig': 1, 'get': 1, 'the': 2, 'gold': 1, "whatever's": 1, 'down': 1, "that's": 1, 'what': 1, 'used': 1, 'going': 1, 'ya': 1, 'and': 1, 'kinda': 1, 'they': 3, 'clean': 3, 'all': 2, 'distance': 1, 'never': 1, 'spot': 1}


In [7]:
def termfreq(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/float(bowCount)
    return tfDict

## sklearn

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

In [36]:
vectorizer = CountVectorizer()

# fit the model to the data 
vecs = vectorizer.fit(sentences)

# transform the data according to the fitted model
bow = vecs.transform(sentences)

# see how many features we have
bow.shape

(8, 45)

In [38]:
vecs.get_feature_names()[0:10]

['ain', 'all', 'and', 'be', 'but', 'clean', 'dig', 'distance', 'don', 'down']

In [44]:
tf_feature_names = vectorizer.get_feature_names()

In [39]:
tf_array = bow.toarray()

In [46]:
wordframe = pd.DataFrame(data= tf_array, columns = tf_feature_names)
wordframe.head(8)

Unnamed: 0,ain,all,and,be,but,clean,dig,distance,don,down,...,there,they,think,to,used,what,whatever,will,ya,you
0,1,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,1,0,...,0,0,1,1,0,0,0,0,0,0
2,0,0,0,0,2,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,1,...,1,0,1,2,0,0,1,1,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,1,1,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
6,0,1,1,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
7,0,1,0,0,1,2,0,1,0,0,...,1,2,0,0,0,0,0,0,0,0


In [48]:
# wordframe.to_csv('outputs/words.csv', sep=',')

## TF-IDF

In [50]:
vectorizer = TfidfVectorizer()

In [51]:
response = vectorizer.fit_transform(sentences)

In [52]:
print(response)

  (0, 18)	0.42484368112477977
  (0, 0)	0.42484368112477977
  (0, 27)	0.35605216161420306
  (0, 38)	0.2384522822616481
  (0, 11)	0.42484368112477977
  (0, 24)	0.42484368112477977
  (0, 35)	0.3072438017722249
  (1, 27)	0.33872226489301194
  (1, 38)	0.22684624845527496
  (1, 8)	0.40416553923915166
  (1, 37)	0.33872226489301194
  (1, 15)	0.25627348996602684
  (1, 20)	0.40416553923915166
  (1, 3)	0.40416553923915166
  (1, 22)	0.40416553923915166
  (2, 38)	0.2248871593770373
  (2, 15)	0.254060261408565
  (2, 4)	0.671593984809181
  (2, 30)	0.40067508568695387
  (2, 31)	0.3357969924045905
  (2, 14)	0.40067508568695387
  (3, 38)	0.2823687076105529
  (3, 35)	0.18191487706664233
  (3, 37)	0.21081364322323456
  (3, 34)	0.21081364322323456
  :	:
  (4, 40)	0.4253747324300509
  (4, 39)	0.4253747324300509
  (5, 15)	0.3249854609640874
  (5, 21)	0.4295403766095668
  (5, 17)	0.4295403766095668
  (5, 12)	0.5125302819766763
  (5, 43)	0.5125302819766763
  (6, 15)	0.2676227777546686
  (6, 31)	0.3537228662630