In [1]:
# IMPORTS
import re 
import nltk
import numpy as np, pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# If needed
parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)",  
                  "\(video\)", "\(laughs\)", "\(applause ends\)", 
                  "\(audio\)", "\(singing\)", "\(music ends\)", 
                  "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", 
                  "\(marimba sounds\)", "\(drum sounds\)" ]

def remove_parentheticals(text):
    global parentheticals
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), 
                          flags=re.IGNORECASE)
    return new_text

In [37]:
# Loading the Data in a gendered partitioned fashion: 
talks_m = pd.read_csv('talks_male.csv', index_col='Talk_ID')
talks_f = pd.read_csv('talks_female.csv', index_col='Talk_ID')
talks_nog = pd.read_csv('talks_nog.csv', index_col='Talk_ID')
talks_all = pd.concat([talks_m, talks_f, talks_nog])

# And then grabbing on the texts of the talks:
texts = talks_all.text.tolist()

print(f"From our {talks_all.shape[0]}x{talks_all.shape[1]} CSV, \
we have a list of {len(texts)} talks.")

From our 992x14 CSV, we have a list of 992 talks.


# Tokenization

From the Terms notebook we know that tokenizers matter:

| METHOD | WORD COUNT | UNIQUE TOKENS |
| -------|------------|---------------|
| regex  |      2146  |       727     |
| nltk words  |  2576 |       725     |
| nltk wpunct | 2686  |        --     |
| scikit |      2035  |       711     |

One of the key differences between the regex and NLTK tokenizers is that the former keeps contractions together while the latter breaks them into constituent, if also somewhat odd, lexicographical units, e.g. *can't* becomes *ca* and *n't*. For now, we are going to use the regex. Later, we may find we need to switch.

For comparison, the code looks like this:

```python
# REGEX
regex = [word for word in re.sub("[^a-zA-Z']"," ", texts[0]).lower().split()]

# NLTK
w_tokens = [word.lower() for word in nltk.word_tokenize(texts[0])]
wp_tokens = [word.lower() for word in nltk.wordpunct_tokenize(texts[0])]
```

To make it easier, we are going to convert the regex line into a function.

In [26]:
corpus = nltk.text.Text(nltk.Text(re.sub("[^a-zA-Z0-9']"," ",'\n'.join(texts)).split()))

## Concordance

- For multi-word phrases pass a list: `text.concordance(["monstrous", "size"])`

In [27]:
corpus.concordance(["sort", "of"])

Displaying 25 of 926 matches:
t black turtleneck and jeans and he sort of did this The crowd went wild but I 
 studying hunger in Africa so I was sort of expected to know a little about the
ferent directions like this so it's sort of difficult to get an example country
y comes with what people would call sort of a rational conclusion to something 
High modernists said we will create sort of singular spaces that are generic al
ng can happen within them I call it sort of shotgun flexibility turn your head 
 design parameters And so with this sort of idea what happens is whatever actua
starts to dwarf the possibility and sort of subsume it of anything else could e
r technologies This diagram was our sort of position piece about the book and o
hat used high modernist flexibility Sort of any activity could happen anywhere 
 back to them they agreed that this sort of core concept could work We got the 
u'll see a series of five platforms sort of combs collective programs And on th
is happeni

In [28]:
# To save to a list:
you_know = corpus.concordance_list(["you", "know"])
you_know[5].line

"e are these wizards everywhere And you know there's a new version of Windows c"

In [56]:
print(f"you_know is a {type(you_know)} with {len(you_know)} items, each \
of which looks like this: \n\n{you_know[0]}")

you_know is a <class 'list'> with 25 items, each of which looks like this: 

ConcordanceLine(left=['I', 'add', 'new', 'images', 'because', 'I', 'learn', 'more', 'about', 'it', 'every', 'time', 'I', 'give', 'it', "It's", 'like', 'beach', 'combing'], query='you know', right=['Every', 'time', 'the', 'tide', 'comes', 'in', 'and', 'out', 'you', 'find', 'some', 'more', 'shells', 'Just', 'in', 'the', 'last'], offset=844, left_print=" I give it It's like beach combing", right_print='Every time the tide comes in and o', line=" I give it It's like beach combing you know Every time the tide comes in and o")


In [29]:
# To focus on one side or the other:
corpus.concordance_list(["sort", "of"])[0].left

['away',
 'it',
 'was',
 'the',
 'MacWorld',
 'Expo',
 'he',
 'came',
 'to',
 'the',
 'stage',
 'in',
 'that',
 'black',
 'turtleneck',
 'and',
 'jeans',
 'and',
 'he']

## Similarity

In [47]:
corpus.similar( "dress", num = 10 ) # Default number of words is 20.

and put make get day call life show give end


## Common Contexts

## Collocations

In [39]:
texts_f = talks_f.text.tolist()
texts_m = talks_m.text.tolist()

corp_f = nltk.text.Text(nltk.Text(re.sub("[^a-zA-Z0-9']"," ",'\n'.join(texts_f)).split()))
corp_m = nltk.text.Text(nltk.Text(re.sub("[^a-zA-Z0-9']"," ",'\n'.join(texts_m)).split()))

In [40]:
corp_f.collocations()

years ago; New York; little bit; United States; I'm going; every
single; years old; year old; skeletal muscle; Applause Thank; we're
going; every day; last year; looks like; York City; black holes; first
time; solar system; high school; I'd like


In [41]:
corp_m.collocations()

years ago; little bit; United States; I'm going; New York; we're
going; looks like; San Francisco; Chris Anderson; I'd like; year old;
human beings; last year; I've got; years old; Laughter Applause; every
day; Los Angeles; much Applause; high school


In [57]:
corp_f.concordance(["sort", "of"])

Displaying 25 of 145 matches:
 ours were deemed off limits to any sort of investment Many landlords believed 
eading the paper or something and I sort of presented myself to them in the doo
 no Santa Claus at all was actually sort of a relief I left the kitchen not rea
of a beautiful woman with long hair sort of lounging by some water but the Libr
 out on a mission and I guess I was sort of curious so I said Well please come 
And I said No I don't because I was sort of upset about this Lamanite story and
ve and then I can make it into some sort of melody and I'll improvise it Wow A 
'm going to try to put that in some sort of order Plays notes OK that's nice So
ighs Oh my gosh Laughter OK this is sort of technology but I can call it a 16th
ron machine and we figured out what sort of forces you needed in order to compr
ur relationship and flattening this sort of world That this woman is in England
ew me towards it There was this big sort of bumpy looking virus on the cover an
egregated 

In [58]:
corp_m.concordance(["sort", "of"])

Displaying 25 of 770 matches:
t black turtleneck and jeans and he sort of did this The crowd went wild but I 
 studying hunger in Africa so I was sort of expected to know a little about the
ferent directions like this so it's sort of difficult to get an example country
y comes with what people would call sort of a rational conclusion to something 
High modernists said we will create sort of singular spaces that are generic al
ng can happen within them I call it sort of shotgun flexibility turn your head 
 design parameters And so with this sort of idea what happens is whatever actua
starts to dwarf the possibility and sort of subsume it of anything else could e
r technologies This diagram was our sort of position piece about the book and o
hat used high modernist flexibility Sort of any activity could happen anywhere 
 back to them they agreed that this sort of core concept could work We got the 
u'll see a series of five platforms sort of combs collective programs And on th
is happeni