# Intro to scikit-learn (sklearn)

_This notebook was inspired by notebooks written by Lauren F. Klein and Melanie Walsh_

Much of what we'll do the rest of the semester entails turning words into numbers: tf-idf, topic modeling, BERT, similarity, classification, clustering. Python's machine learning library, scikit-learn, will be crucial to many of these methods. Today we'll just introduce ourselves to the library, setting ourselves up for what's to come.

## Install scikit-learn

We begin by installing scikit-learn as `sklearn`

In [None]:
!pip install sklearn

## Import CountVectorizer

Now we import `CountVectorizer`, which [converts a collection of text documents to a matrix of token counts](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Turning words into numbers in this way allows us to perform analyses according to a philosophy of language called `distributional semantics`, which is at the basis of much of data science with text. Learn more by referring to [today's reading](https://web.stanford.edu/~jurafsky/slp3/6.pdf)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

#instantiate CountVectorizer()
cv=CountVectorizer()

## Read in lyrics corpus

Below we're setting the directory filepath that contains all the lyrics text files that we want to analyze.

In [None]:
import os

base_dir = "./lyrics/"

all_docs = []

docs = os.listdir(base_dir)

for doc in docs:
    with open(base_dir + doc, "r") as file:
        text = file.read()
        all_docs.append(text)

# just take a look at the first item to be sure
print(docs[0]) 
print("\n")
print(all_docs[0])

In [None]:
# this steps generates document-term matrix for the docs
dtm=cv.fit_transform(all_docs)

# check shape
dtm.shape

You should get two numbers `(rows, columns)`. Each `row` is a doc, in this case a song. Each `column` records an element of the total vocabulary for the corpus. Does the result make sense?

In [None]:
# we can look at the vocabulary and counts like this

# sum_words is a vector that contains
# the sum of each word occurrence in all 
# texts in the corpus. In other words, 
# we are adding the elements for each column of
# the document-term matrix

for x in range(15):
    print(str(cv.get_feature_names()[x]) + ": " + str(dtm.toarray().sum(axis=0)[x]) + "\n") 


In [None]:
# and we can sort it like this:

# first we create a dictionary with vocab as keys and counts as values

dictVocab = {}
for x in range(dtm.shape[1]):
    dictVocab[cv.get_feature_names()[x]]=dtm.toarray().sum(axis=0)[x]

# then we sort the dictionary in order of counts

sortVocab = sorted(dictVocab.items(), key=lambda x: x[1], reverse=True)

# then we print top 30

for i in sortVocab[0:30]:
    print(i[0], i[1])

In [None]:
# we could also print from a bit lower in the counts

for i in sortVocab[200:230]:
    print(i[0], i[1])

That's it for today! scikit-learn and CountVectorizer set us up for the rest of the semester...