# Text Processing with Python
The goal of this workshop is to process, annotate, and analyze text in order to enable
analysis and insights.

To do so, we will be working with the Python programming langauge, and in particular, a package called [spaCy](https://spacy.io/), which was designed to work with text-based data. With spaCy, which is the industry standard for text-based analysis, we will extract patterns based on linguistic elements.

The dataset we will use comes from from a reality TV show, Love Is Blind, which is a dating show. To get the dataset, I scraped transcripts of the show and saved them to a .json file, which we will download through a URL. From this dataset, we will ask questions about the basic plot of the show, which is a romantic plot between several characters. We will also explore less obvious elements from the narrative, using spaCy to surface things we may not have noticed before.

Let's jump in!

## loading our libraries and dataset

In [1]:
# importing our spacy and pandas libraries
import spacy
import pandas as pd

# loading up the model in english
nlp = spacy.load("en_core_web_sm")

In [2]:
df = pd.read_json("https://bit.ly/lib_transcripts")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39 entries, 0 to 38
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   title            39 non-null     object
 1   episode_info     39 non-null     object
 2   transcript_text  39 non-null     object
 3   url              39 non-null     object
dtypes: object(4)
memory usage: 1.3+ KB


In [4]:
# displaying columns

df['transcript_text'][:5]

0    My physical insecurities have definitely affec...
1    I have spent countless hours engaging with pot...
2    Fuck. Um... Okay... I said yes to Kyle... beca...
3    In Mexico, the couples continue taking steps t...
4    Time winds down in Mexico, but not before a te...
Name: transcript_text, dtype: object

In [5]:
df['transcript_text'][1]

'I have spent countless hours engaging with potential partners and, after all of that, like, I\'ve actually found my wife. It\'s just so crazy that we were able to do that, like, through pods. It\'s so fricking weird. I am completely in love with Nick. He\'s not only my fiancé, he\'s my best friend, and I\'m about to see him for the first time. My God. Hello. - You\'re beautiful. - You\'re beautiful. - I can feel your heart beating. - Really pumping there. Mine is just as much, yeah. - Seems like it\'s been a lifetime. - Yeah, it has been. - You look so pretty. - Yeah? - I was so scared. - Why were you so scared? Why do you think? I was like, "What if he doesn\'t think I\'m cute?" Of course I think you\'re cute. - My fiancé. - I can\'t believe it. - I love you. - Love you too. - You say it while I can see you. - I know. I love you. Love you, too. Being able to see the face behind the voice was just amazing. He has the most beautiful blue eyes, and I have blue eyes. Our kids are gonna b

## exploring linguistic annotations

In [6]:
doc = nlp(df['transcript_text'][1])

In [7]:
## commenting out this line because output is too long

# doc

We will use list slicing to explore sections (slices) of our dataset

In [8]:
# how do we create slices? let's see the first 10 words in our doc.

doc[:10]

I have spent countless hours engaging with potential partners and

In [9]:
# practice creating slices of random bits of text

doc[:100]

I have spent countless hours engaging with potential partners and, after all of that, like, I've actually found my wife. It's just so crazy that we were able to do that, like, through pods. It's so fricking weird. I am completely in love with Nick. He's not only my fiancé, he's my best friend, and I'm about to see him for the first time. My God. Hello. - You're beautiful. - You're beautiful. - I can

In [10]:
doc[300:350]

- I'm so awkward. - Oh my... Oh, I like the socks. - That's my dog. - I figured. Oh, he's cute. Yep. It's your dog now too. He's been looking for a mommy for

In [11]:
doc[-100:]

to think of all of my thoughts and emotions about how I feel, and when I think about this whole experience, it's like... This is exactly how I feel like we were supposed to meet. Are you okay? I'm good. Um... So... Is it Shaina? Yeah. I'm just... Would love for you to hear me out real quick. Okay. Um... Okay. I just wanted you to know that I do have deep feelings for you. What the fuck. Fuck.

In [12]:
len(doc)

8885

We use loops to iterate through our dataset and pull out individual words. In this case, we are pulling out not only the word, called `token`, but also its attributes. Read more about tokens and their attributes on the [spaCy docs](https://spacy.io/api/token#attributes).

In [13]:
# writing loops to pull out data

for i in doc[:5]:
  print(i.text)
  print(i.lemma_) # why would you want to pull out the lemma?

I
I
have
have
spent
spend
countless
countless
hours
hour


In [14]:
# part of speech

for i in doc[:5]:
  print(i.text, i.pos_)

I PRON
have AUX
spent VERB
countless ADJ
hours NOUN


In [15]:
# other useful properties

for i in doc[:5]:
  print(i.text, i.pos_, i.shape_, i.is_alpha, i.is_stop)

I PRON X True True
have AUX xxxx True True
spent VERB xxxx True False
countless ADJ xxxx True False
hours NOUN xxxx True False


In [16]:
# filtering:
# using conditional statement to print out nouns

for i in doc[:100]:
  if i.pos_ == "NOUN":
    print(i)

hours
partners
wife
pods
love
fiancé
friend
time


In [17]:
# filtering:
# using conditional statement to print out verbs

for i in doc[:100]:
  if i.pos_ == "VERB":
    print(i)

spent
engaging
found
do
see


In [18]:
# see the docs object is quite complex, filled with annotations!

## pulling out phrases

In [19]:
# noun chunks with spacy

for chunk in doc[:20].noun_chunks:
  print(chunk)

I
countless hours
potential partners
all
that
I


In [20]:
for i in doc[:100].sents:
  print(i)

I have spent countless hours engaging with potential partners and, after all of that, like, I've actually found my wife.
It's just so crazy that we were able to do that, like, through pods.
It's so fricking weird.
I am completely in love with Nick.
He's not only my fiancé, he's my best friend, and I'm about to see him for the first time.
My God.
Hello.
- You're beautiful.
- You're beautiful.
- I can feel your heart beating.


## pulling out entities

In [21]:
doc[:1000].ents

[Nick,
 first,
 today,
 the wedding day,
 tomorrow,
 Nick,
 second,
 Mexico,
 Yo,
 five plus minutes,
 Nah,
 Four minutes, 37 seconds,
 30 more seconds,
 homie,
 yesterday]

In [22]:
# pull out entities

for i in doc[:1000].ents:
  print(i, i.label_)

Nick GPE
first ORDINAL
today DATE
the wedding day DATE
tomorrow DATE
Nick GPE
second ORDINAL
Mexico GPE
Yo PERSON
five plus minutes TIME
Nah PERSON
Four minutes, 37 seconds TIME
30 more seconds TIME
homie PERSON
yesterday DATE


## expanding from doc to docs
Now we will expand to all 13 episodes from the second season of the show. Then, we will be able to explore what happens episode by episode.

In [23]:
# we are going to make a new doc based on just the second season of the show.

docs = list(nlp.pipe(df['transcript_text'][0:13]))

In [24]:
len(docs)

13

The syntax in our loop is going to change. Now we need to use list indexing to indicate the *episode* we want to loop through in square brackets. After that, we can indicate the slice of words.

In [25]:
# pull out entities for the first 200 words [:200] of the last episode [-1]

for i in docs[-1][:200].ents:
  print(i, i.label_)

last night TIME
Danielle GPE
This summer DATE
this summer DATE
Shaina GPE
Shayne GPE
Natalie PERSON
Natalie PERSON
Nick ORG
Shaina LOC


In [26]:
# pulling out entities for the first 200 words in the first episode

for i in docs[0][:200].ents:
  print(i, i.label_)

Nick Lachey PERSON
Love Is Blind WORK_OF_ART


Another way of doing this is to add another level to our loop. This way, we can loop through all of the episodes

In [27]:
for episode in docs:
  for word in episode.ents[:3]: # slicing it at 3 words into each episode, to avoid being too long in the notebook
    print(word)

Nick Lachey
Love Is Blind
Love Is Blind
Nick
first
today
Kyle
Kyle
Shayne
Mexico
first
first
Mexico
Burn Through The Dark
John Coggins
one
Kyle
Shayne
Shaina
Shayne
Natalie
Shake
Deeps
Shayne
first
another 30 years
five
Love
Blind
Nick Lachey
season
Nick and Vanessa
Vanessa Lachey
Kyle
Jarrette
First
last night
Danielle
This summer


In [28]:
# how would I pull out entities for just characters?

for episode in docs:
  for word in episode.ents[:3]:
    if word.label_ == "PERSON":
      print(word)

Nick Lachey
Kyle
Kyle
John Coggins
Kyle
Deeps
Blind
Nick Lachey
Vanessa Lachey
Kyle
Jarrette


## who are the couples? in the first 3 episodes
Try to determine the identities of the couples that are dating each other on the show.

How might you go about doing that, using spaCy's annotation capabilities? What linguistic elements would help you to assess whether or not a character is a main character?

Go through the first 3 episodes, and determine who is dating whom in that episode.


In [29]:
for word in docs[0].ents:
  if word.label_ == "PERSON":
    print(word)

Nick Lachey
Howdy Doody
Mom
Jer
Sal
Wine Danielle
J. Lo
Instagram
Latina
Shayne
Kyle
Natalie
Natalie
Dude
Beetlejuice
Beetlejuice
Shayne
Holly
Molly
Molly
Holly
Holly
Molly
Nick
Jeez
Nick
God
Kyle
Kyle
Kyle
Shayne
Shaina
Natalie
Shayne
Shayne
Natalie
ya
Ring Pop
Shaina
Natalie
Natalie
Natalie
Nick
Nick


## what happens to the couples?

Now for the big question: what happens to the couples in the show? Do they stay together? Why or why not?

We will use spacy's linguistic annotations to write a conditional statement that pulls out sentences relevant to our inquiry. We want a sentence like "person loves person", or "person married person", or "person dumped person".

First, we create a list of romantic verbs, which we will use to filter sentences. For that, we can re-deploy our conditional statement above that checks for verbs.

In [30]:
for episode in docs:
  for i in episode[:10]:
    if i.pos_ == 'VERB':
      print(i)

affected
dating
spent
engaging
said
continue
taking
winds
bid
head
spark
face
Shake
run
takes
heartbreaking
decide
gather
vocalizing
plays
Look


Then we write out the condition for filtering, which checks for a part of speech and if the verb is in our list.

In [31]:
romantic_verbs = ["love", "marry", "kiss", "propose", "date", "betray"]

for i in docs[0].sents:
  for token in i:
    if token.lemma_.lower() in romantic_verbs and token.pos_ == "VERB":
      print(i)

My physical insecurities have definitely affected my dating life.
My dating experience has sucked.
I'm here because I'm looking for the one, the one who loves me for my personality, and not what I look like.
I don't think I ever said to anyone, "I love you," except my mom.
Do you feel like the dating world nowadays has become extremely superficial?
When I meet men on dating apps, I think they see Asian and stereotype me, thinking I'm gonna be quiet, more submissive.
- We wanna be loved for who we are.
...or will the real world sabotage that love, and will you walk away from that person forever?
"I met your dad in a social experiment, where I was dating 14 other guys."
Um, I love running, I love dancing.
Um, I love running, I love dancing.
Sundays, I love...
- I'm loving it for sure.
I love them all.
I love to punch drywall.
I love buying clothes for girls.
- I prefer dating younger.
- You've never dated an Indian girl? -
I've actually only dated white guys before.
- I love that.
I love

Now we will add a condition that checks if there's a PER entity in our sentence. Our loop looks a little unwieldy, but it works!

In [32]:
romantic_verbs = {"love", "marry", "kiss", "propose", "date", "betray"}

for i in docs[-1].sents:
  for token in i:
    if token.lemma_.lower() in romantic_verbs and token.pos_ == "VERB":
      for ent in i.ents:
        if ent.label_ == "PERSON":
          print(i)

Just that, like, when Natalie and Shayne were still dating, that you were, like, sending him flirty DMs and stuff.
Just that, like, when Natalie and Shayne were still dating, that you were, like, sending him flirty DMs and stuff.
- Tell Danielle we love her.
I do feel that way, and I do feel that way about Jessi, and I've been thinking a lot about… [sharp exhale] …like, what it would mean to propose to Jessi.
I do feel that way, and I do feel that way about Jessi, and I've been thinking a lot about… [sharp exhale] …like, what it would mean to propose to Jessi.
I really do love Jessi
[chuckles] Yanni, I love you.
At the Reunion, I told Deepti I made a huge mistake, and I didn't ask her to marry me.
[Jarrette] I didn't prioritize everything that I needed to do when we first got married, and now it's at a point where my friends will be there.


Spend some time exploring the dataset, episode by episode, and try to figure out what happens with the couples.