# Text Processing with Python
The goal of this workshop is to process, annotate, and analyze text in order to enable
analysis and insights.

To do so, we will be working with the Python programming langauge, and in particular, a package called [spaCy](https://spacy.io/), which was designed to work with text-based data. With spaCy, which is the industry standard for text-based analysis, we will extract patterns based on linguistic elements.

The dataset we will use comes from from a reality TV show, Love Is Blind, which is a dating show. To get the dataset, I scraped transcripts of the show and saved them to a .json file, which we will download through a URL. From this dataset, we will ask questions about the basic plot of the show, which is a romantic plot between several characters. We will also explore less obvious elements from the narrative, using spaCy to surface things we may not have noticed before.

Let's jump in!

## loading our libraries and dataset

In [None]:
# importing our spacy and pandas libraries
import spacy
import pandas as pd

# loading up the model in english
nlp = spacy.load("en_core_web_sm")

In [None]:
df = pd.read_json("https://bit.ly/lib_transcripts")

In [None]:
df.info()

In [None]:
# displaying columns

df['transcript_text'][:5]

In [None]:
df['transcript_text'][1]

## exploring linguistic annotations

In [None]:
doc = nlp(df['transcript_text'][1])

In [None]:
## commenting out this line because output is too long

# doc

We will use list slicing to explore sections (slices) of our dataset

In [None]:
# how do we create slices? let's see the first 10 words in our doc.

doc[:10]

In [None]:
# practice creating slices of random bits of text

doc[:100]

In [None]:
doc[300:350]

In [None]:
doc[-100:]

In [None]:
len(doc)

We use loops to iterate through our dataset and pull out individual words. In this case, we are pulling out not only the word, called `token`, but also its attributes. Read more about tokens and their attributes on the [spaCy docs](https://spacy.io/api/token#attributes).

In [None]:
# writing loops to pull out data

for i in doc[:5]:
  print(i.text)
  print(i.lemma_) # why would you want to pull out the lemma?

In [None]:
# part of speech

for i in doc[:5]:
  print(i.text, i.pos_)

In [None]:
# other useful properties

for i in doc[:5]:
  print(i.text, i.pos_, i.shape_, i.is_alpha, i.is_stop)

In [None]:
# filtering:
# using conditional statement to print out nouns

for i in doc[:100]:
  if i.pos_ == "NOUN":
    print(i)

In [None]:
# filtering:
# using conditional statement to print out verbs

for i in doc[:100]:
  if i.pos_ == "VERB":
    print(i)

In [None]:
# see the docs object is quite complex, filled with annotations!

## pulling out phrases

In [None]:
# noun chunks with spacy

for chunk in doc[:20].noun_chunks:
  print(chunk)

In [None]:
for i in doc[:100].sents:
  print(i)

## pulling out entities

In [None]:
doc[:1000].ents

In [None]:
# pull out entities

for i in doc[:1000].ents:
  print(i, i.label_)

## expanding from doc to docs
Now we will expand to all 13 episodes from the second season of the show. Then, we will be able to explore what happens episode by episode.

In [None]:
# we are going to make a new doc based on just the second season of the show.

docs = list(nlp.pipe(df['transcript_text'][0:13]))

In [None]:
len(docs)

The syntax in our loop is going to change. Now we need to use list indexing to indicate the *episode* we want to loop through in square brackets. After that, we can indicate the slice of words.

In [None]:
# pull out entities for the first 200 words [:200] of the last episode [-1]

for i in docs[-1][:200].ents:
  print(i, i.label_)

In [None]:
# pulling out entities for the first 200 words in the first episode

for i in docs[0][:200].ents:
  print(i, i.label_)

Another way of doing this is to add another level to our loop. This way, we can loop through all of the episodes

In [None]:
for episode in docs:
  for word in episode.ents[:3]: # slicing it at 3 words into each episode, to avoid being too long in the notebook
    print(word)

In [None]:
# how would I pull out entities for just characters?

for episode in docs:
  for word in episode.ents[:3]:
    if word.label_ == "PERSON":
      print(word)

## who are the couples? in the first 3 episodes
Try to determine the identities of the couples that are dating each other on the show.

How might you go about doing that, using spaCy's annotation capabilities? What linguistic elements would help you to assess whether or not a character is a main character?

Go through the first 3 episodes, and determine who is dating whom in that episode.


In [None]:
for word in docs[0].ents:
  if word.label_ == "PERSON":
    print(word)

## what happens to the couples?

Now for the big question: what happens to the couples in the show? Do they stay together? Why or why not?

We will use spacy's linguistic annotations to write a conditional statement that pulls out sentences relevant to our inquiry. We want a sentence like "person loves person", or "person married person", or "person dumped person".

First, we create a list of romantic verbs, which we will use to filter sentences. For that, we can re-deploy our conditional statement above that checks for verbs.

In [None]:
for episode in docs:
  for i in episode[:10]:
    if i.pos_ == 'VERB':
      print(i)

Then we write out the condition for filtering, which checks for a part of speech and if the verb is in our list.

In [None]:
romantic_verbs = ["love", "marry", "kiss", "propose", "date", "betray"]

for i in docs[0].sents:
  for token in i:
    if token.lemma_.lower() in romantic_verbs and token.pos_ == "VERB":
      print(i)

Now we will add a condition that checks if there's a PER entity in our sentence. Our loop looks a little unwieldy, but it works!

In [None]:
romantic_verbs = {"love", "marry", "kiss", "propose", "date", "betray"}

for i in docs[-1].sents:
  for token in i:
    if token.lemma_.lower() in romantic_verbs and token.pos_ == "VERB":
      for ent in i.ents:
        if ent.label_ == "PERSON":
          print(i)

Spend some time exploring the dataset, episode by episode, and try to figure out what happens with the couples.