# Lab 2 Text Data

Cameron Matson

## Purpose

This dataset is a collection reviews from Pitchfork, a music review site.  It 18,393 reviews contains reviews from 1999 to the present day.  They look at a variety There is a lot of research being done related to the way that people interact with music, specifically how they describe it in words.  What I'd like to get out of this data is if there is any relation between way the music is described in the reviews and the music itself.

## Data

In [158]:
import pandas as pd
import numpy as np

reviews = pd.read_csv('../data/text.csv')
print('number of reviews:',len(reviews))
reviews.head()

number of reviews: 18393


Unnamed: 0,reviewid,content
0,22703,"“Trip-hop” eventually became a ’90s punchline,..."
1,22721,"Eight years, five albums, and two EPs in, the ..."
2,22659,Minneapolis’ Uranium Club seem to revel in bei...
3,22661,Kleenex began with a crash. It transpired one ...
4,22725,It is impossible to consider a given release b...


That's a lot of reviews, for efficiency's sake lets just take a sample

In [162]:
reviews = reviews.sample(n=1000).dropna()
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 16891 to 3722
Data columns (total 2 columns):
reviewid    1000 non-null int64
content     1000 non-null object
dtypes: int64(1), object(1)
memory usage: 23.4+ KB


### Bag of Words Model


In [169]:
from sklearn.feature_extraction.text import CountVectorizer


counter = CountVectorizer(stop_words='english')

bag_words = counter.fit_transform(reviews.content.tolist())
print(bag_words.shape)

(1000, 37189)


In [170]:
#print(counter.vocabulary_)

That's a lot of words.  How might we be able to reduce it?  The problem is that the most common way that people talk about music is through metaphor, so it makes it difficult to choose a narrow vocabulary.  For the purpose of this analysis I'm going to do a few things though:
- removing proper nouns: most of these will probably correspond to artists and the titles of songs and albums
- removing numbers: while it actually might be interesting to include dates, the majority of digits in this dataset are probably trash
- non-english words: again would be nice ESPECIALLY when talking about the actual music itself (music lexicon has a lot non-english words)

Those three criteria should be pretty well contained in the Scrabble dictionary.

In [171]:
with open('../data/scrabble.txt') as f:
    vocab = f.read().split('\n')
    
counter = CountVectorizer(stop_words='english', vocabulary=vocab) # an object capable of counting words in a document!
bag_words = counter.fit_transform(reviews.content.tolist())
print(bag_words.shape)

(1000, 79340)


In [172]:
df = pd.DataFrame(data=bag_words.toarray(), columns=counter.get_feature_names())

# this vocab is actually larger than what we started with, but we'll drop columns of zeros
df = df.loc[:, (df != 0).any(axis=0)]
df.shape

(1000, 17584)

In [173]:
df

Unnamed: 0,aa,aah,aahing,aahs,aback,abandon,abandons,abates,abattoir,abba,...,zoned,zones,zoning,zonked,zoo,zoom,zooming,zowie,zucchini,zydeco
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Better.

### TF-IDF
We can reduce the size of the dataset more by doing a TFIDF conversion and thresholding the result.

In [197]:
from sklearn.feature_extraction.text import TfidfVectorizer


tfidfer = TfidfVectorizer(stop_words='english', vocabulary=vocab)

tfidf = tfidfer.fit_transform(reviews.content.tolist())
print(tfidf.shape)

(1000, 79340)


In [198]:
tfidf

<1000x79340 sparse matrix of type '<class 'numpy.float64'>'
	with 212371 stored elements in Compressed Sparse Row format>

In [201]:
df_tfidf = pd.DataFrame(data=tfidf.toarray(),columns=tfidfer.get_feature_names())


In [202]:
df_tfidf

Unnamed: 0,aa,aah,aahed,aahing,aahs,aal,aalii,aaliis,aals,aardvark,...,zymology,zymosan,zymosans,zymoses,zymosis,zymotic,zymurgy,zyzzyva,zyzzyvas,Unnamed: 21
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [203]:
df_tfidf.max().sort_values()[-10:]

lynch       0.763375
barlow      0.770741
riley       0.794368
volta       0.797130
cash        0.798387
usher       0.798736
lungfish    0.807957
yonkers     0.810825
beep        0.813137
hood        0.829786
dtype: float64