# Intro to scikit-learn (sklearn)

_This notebook was inspired by notebooks written by Lauren F. Klein and Melanie Walsh_

Much of what we'll do the rest of the semester entails turning words into numbers: tf-idf, topic modeling, BERT, similarity, classification, clustering. Python's machine learning library, scikit-learn, will be crucial to many of these methods. Today we'll just introduce ourselves to the library, setting ourselves up for what's to come.

## Install scikit-learn

We begin by installing scikit-learn as `sklearn`

In [1]:
!pip install sklearn



## Import CountVectorizer

Now we import `CountVectorizer`, which [converts a collection of text documents to a matrix of token counts](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Turning words into numbers in this way allows us to perform analyses according to a philosophy of language called `distributional semantics`, which is at the basis of much of data science with text. Learn more by referring to [today's reading](https://web.stanford.edu/~jurafsky/slp3/6.pdf)

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

#instantiate CountVectorizer()
cv=CountVectorizer()

## Read in lyrics corpus

Below we're setting the directory filepath that contains all the lyrics text files that we want to analyze.

In [3]:
import os

base_dir = "./lyrics/"

all_docs = []

docs = os.listdir(base_dir)

for doc in docs:
    with open(base_dir + doc, "r") as file:
        text = file.read()
        all_docs.append(text)

# just take a look at the first item to be sure
print(docs[0]) 
print("\n")
print(all_docs[0])

The-who-baba-oriley.txt



Out here in the fields, I fight for my meals
I get my back into my living
I don't need to fight to prove I'm right
I don't need to be forgiven, yeah, yeah, yeah, yeah, ye-ah


Don't cry, don't raise your eye
It's only teenage wasteland

Sally, take my hand, we'll travel south 'cross land
Put out the fire and don't look past my shoulder
The exodus is here, the happy ones are near
Let's get together before we get much older

Teenage wasteland, it's only teenage wasteland
Teenage wasteland, oh, yeah
Teenage wasteland
They're all wasted!

[Instrumental]


In [4]:
# this steps generates document-term matrix for the docs
dtm=cv.fit_transform(all_docs)

# check shape
dtm.shape

(74, 2804)

You should get two numbers `(rows, columns)`. Each `row` is a doc, in this case a song. Each `column` records an element of the total vocabulary for the corpus. Does the result make sense?

In [5]:
# we can look at the vocabulary and counts like this

# sum_words is a vector that contains
# the sum of each word occurrence in all 
# texts in the corpus. In other words, 
# we are adding the elements for each column of
# the document-term matrix

for x in range(15):
    print(str(cv.get_feature_names()[x]) + ": " + str(dtm.toarray().sum(axis=0)[x]) + "\n") 


128: 1

1956: 1

1989: 1

22nd: 1

41: 1

441: 1

45: 1

57: 1

abilities: 1

able: 1

abono: 1

about: 42

above: 5

ac: 1

accept: 2



In [6]:
# and we can sort it like this:

# first we create a dictionary with vocab as keys and counts as values

dictVocab = {}
for x in range(dtm.shape[1]):
    dictVocab[cv.get_feature_names()[x]]=dtm.toarray().sum(axis=0)[x]

# then we sort the dictionary in order of counts

sortVocab = sorted(dictVocab.items(), key=lambda x: x[1], reverse=True)

# then we print top 30

for i in sortVocab[0:30]:
    print(i[0], i[1])

the 1022
you 932
to 547
and 506
it 469
me 367
on 346
my 314
that 261
we 253
in 238
yeah 231
of 219
can 213
with 213
be 206
no 203
your 200
oh 199
got 193
just 182
baby 179
love 178
is 172
don 167
all 165
for 163
re 148
day 135
this 129


In [7]:
# we could also print from a bit lower in the counts

for i in sortVocab[200:230]:
    print(i[0], i[1])

check 20
everybody 20
generation 20
look 20
uh 20
alright 19
everywhere 19
hell 19
nothin 19
pain 19
queen 19
radio 19
shot 19
spark 19
truth 19
believer 18
break 18
eyes 18
gon 18
hair 18
keeps 18
last 18
mutha 18
something 18
soul 18
than 18
then 18
these 18
tonight 18
an 17


That's it for today! scikit-learn and CountVectorizer set us up for the rest of the semester...