# One Hit Wonders

## Summary

After loading the data, we use the modified vectorizing protocol -- though it's not entirely necessary -- to create our term matrix which only allows for words that appear in one document. 

After converting the matrix into a dataframe, we sum the dataframe into a series, and then filter the series for those words that appear more than once in a document and then explore the context in which those terms appear.

## Imports and Data

In [1]:
# Imports
import pandas as pd, re, nltk
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Load the Data
df = pd.read_csv('../output/TEDall_speakers.csv')
df.shape

urls  = df.public_url.tolist()
texts = df.text.tolist()

In [3]:
parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)", "\(video\)", 
                  "\(laughs\)", "\(applause ends\)", "\(audio\)", "\(singing\)", 
                  "\(music ends\)", "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", "\(marimba sounds\)", 
                  "\(drum sounds\)" ]


def clean_parens(text):
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), flags=re.IGNORECASE)
    return new_text

## A Very Peculiar Vector: `max_df = 1`

In [4]:
vec = CountVectorizer(preprocessor = clean_parens, max_df = 1)
X = vec.fit_transform(texts)
X.shape

(1747, 21037)

In [5]:
dfx = pd.DataFrame(X.toarray(), columns = vec.get_feature_names())

# Create a series of just the terms and the number of times they occur in total
sums = dfx.sum()

# Save to CSV
#sums.to_csv('../output/one_doc_wonders.csv')

`sums` is a pandas series. You can think of series as a one-column dataframe such that you have an index, here `sums.index` and a column with contents, here `sums.values`. You can sort by either:

In [6]:
sums = sums.sort_values(ascending = False)
sums.head()

bf           75
gk           46
telomeres    40
abed         39
mzuri        36
dtype: int64

## Words That Appear More Than Once But Only in One Document

Filtering a series is just like filtering a dateframe, and here we create a new series with that data and save it to a CSV.

In [7]:
repeated = sums [ sums > 1]
len(repeated)

3895

In [8]:
# Write repeated to CSV:
# repeated.to_csv('../output/one_doc_repeated.csv')

In [9]:
repeated.head(10)

bf           75
gk           46
telomeres    40
abed         39
mzuri        36
jf           36
indus        33
fonio        32
teszler      30
tkm          30
dtype: int64

It's time to take see these terms in context. For now, we will convert the list of texts into a single NLTK text, but after that we will want to see in which texts these words occur.

In [10]:
onetext = nltk.Text(re.sub("[^a-zA-Z0-9']"," ",'\n'.join(texts)).split())

In [11]:
onetext.concordance("tkm")

Displaying 25 of 30 matches:
w Laughter Siphumeze Khundayi Hi guys TKM Hello everyone TKM So you think you k
e Khundayi Hi guys TKM Hello everyone TKM So you think you know about sex Chanc
e going to keep things safe and spicy TKM So now the act of rubbing our naked b
are going to live our best sexy lives TKM And we're going to tell you how to ha
out the things that we need to change TKM And the things we need to embrace in 
 From the top of my head rape culture TKM How tradition and culture limit ideas
hen he's trying to turn up the volume TKM Like that is a personal pet peeve of 
ve of mine SK We are so scared of sex TKM And we need somebody to blame for our
 see how well that goes down Laughter TKM Does not go down well I once challeng
at you saw on the internet by mistake TKM Mhm So now in order to cure this ailm
ach us to help us upgrade the present TKM So now if I had a glass of Merlot whi
t be named SK Whispering Colonization TKM Came through Within African societies
es of old w

### Initial Observations

* Unexplained (artifacts of some kind?): bf, gk, jf, tkm.
* Yup, used: telomeres (biological), Indus, fonio (some kind of grain).
* Name: Abed, Teszler.
* Song lyric: **mzuri** (in a transliterated language).