# Section 9. Text Analysis Practice

#### Instructor: Pierre Biscaye

The purpose of this notebook is to give you opportunities and challenge to practice applying the skills developed in the other notebooks. 

The content of this notebook is taken from UC Berkeley D-Lab's Python Text Analysis [course](https://github.com/dlab-berkeley/Python-Text-Analysis).


In [None]:
import pandas as pd
import os
import re
import nltk
import spacy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
%matplotlib inline

## Challenge 1: Extracting and Counting Substrings

Adapting the code to extract twitter handle mentions from the twitter data, write code to extract all hashtags. Keep the results as lists. Then, using this information, calculate the count of mentions for each hashtag across all tweets. Plot a bar chart of mentions for the 10 most common hashtags. 

In [None]:
# Specify the separator to be comma
tweets = pd.read_csv('Data/airline_tweets.csv', sep=',')

# Your code

## Challenge 2: Preprocessing with Multiple Steps

So far we've learned a few preprocessing operations, let's put them together in a function! This function would be a handy one to refer to if you happen to work with some messy English text data, and you want to preprocess it with a single function. 

The below code reads in example text data for this challenge. Write a function to:
- Lowercase the text
- Remove punctuation marks
- Remove extra whitespace characters

Feel free to recycle the code we used in the other notebook!

In [None]:
challenge1_path = 'Data/example1.txt'

with open(challenge1_path, 'r') as file:
    challenge1 = file.read()
    
print(challenge1)

In [None]:
from string import punctuation

def remove_punct(text):
    '''Remove punctuation marks in input text'''
    
    # Select characters not in puncutaion
    no_punct = []
    for char in text:
        if char not in punctuation:
            no_punct.append(char)

    # Join the characters into a string
    text_no_punct = ''.join(no_punct)   
    
    return text_no_punct

In [None]:
# Write a pattern in regex
blankspace_pattern = r'\s+'

# Write a replacement for the pattern identfied
blankspace_repl = ' '

def clean_text(text):

    # Step 1: Lowercase the input text
    text = text.lower()

    # Step 2: Use remove_punct to remove puncutuation marks
    text = remove_punct(text)

    # Step 3: Remove extra whitespace characters
    text = re.sub(blankspace_pattern, blankspace_repl, text)
    text = text.strip()
    
    return text

In [None]:
clean_text(challenge1)

## Challenge 3: Remove Stop Words

We have known how `nltk` and `spaCy` work as NLP packages. We've also demostrated how to identify stop words with each package. 

Let's write **two** functions to remove stop words from our text data. 

- Complete the function for stop words removal using `nltk`
    - The starter code requires two arguments: the raw text input and a list of predefined stop words
- Complete the function for stop words removal using `spaCy`
    - The starter code requires one argument: the raw text input
 
A little reminder before we dive in: both functions take raw text as input, so that's a signal to perform tokenization on the raw text first!

In [None]:
stop = stopwords.words('english')

def remove_stopword_nltk(raw_text, stopword):
    
    # Step 1: Tokenization with nltk
    tokens = word_tokenize(raw_text)
    
    # Step 2: Filter out tokens in the stop word list
    text = [token for token in tokens if token not in stopword]
    
    return text

In [None]:
nlp = spacy.load('en_core_web_sm')

def remove_stopword_spacy(raw_text):

    # Step 1: Apply the nlp pipeline
    doc = nlp(raw_text)
    
    # Step 2: Filter out tokens in the stop word list
    text = [token.text for token in doc if token.is_stop is False]

    return text

In [None]:
text = tweets['text'][7]

In [None]:
remove_stopword_nltk(text, stop)

In [None]:
remove_stopword_spacy(text)

## Challenge 4: Find the Word Boundary

Now we know that tokenization in BERT often returns subwords. Let's try a few more examples! 

Do the results make sense to you? What do you think is the correct word boundary to split the following words into subwords? 

Also feel free to read more about limitations of the WordPiece algorithm. For instance, [this blog post](https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99) dives into reasons why does it fail, and [this one](https://tinkerd.net/blog/machine-learning/bert-tokenization/#demo-bert-tokenizer) introduces the mechanism underlying the algoritm. 

In [None]:
from transformers import BertTokenizer
# Initialize the tokenizer 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def get_tokens(string):
    '''Tokenize the input string with BERT'''
    tokens = tokenizer.tokenize(string)
    return print(tokens)

In [None]:
# Abbreviations
get_tokens('Clermont-Ferrand')

# Prefix
get_tokens('unstoppable')

# Digits
get_tokens('378')

# YOUR EXAMPLE

## Challenge 5: Words with Highest Mean TF-IDF scores

In notebook 9b, we got tf-idf values for each term in each document. Does that inform us anything about our data? Instead of focusing on the tf-idf value of any particular word, let's take a step back. Are there any words that are particularly informative for tweets that have been classified as positive/negative? 

Let's gather the indices to all positive/negative tweets, and calculate the mean tf-idf scores of words appear in positive/negative tweets. 

We've provided the following starter codes to scaffold:
- Use boolean masks to select tweets that have positive/negative sentiments, retrieve the indices, and assign them to `positive_index`/`negative_index`
- Select positive/negative tweets in the tfidf dataframe, and take the mean tf-idf values across the documents, sort the mean values in the descedning order, and get the top 10 terms. 

After you've completed the following two cells, you can plot the words having the highest mean tf-idf scores for each subset. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tweets = pd.read_csv('Data/tweets_clean.csv', sep=',')

# Create a tfidf vectorizer
vectorizer = TfidfVectorizer(lowercase=True,
                             stop_words='english',
                             min_df=2,
                             max_df=0.95,
                             max_features=None)

# Fit and transform 
tf_dtm = vectorizer.fit_transform(tweets['text_lemmatized'])

# Create a tf-idf dataframe
tfidf = pd.DataFrame(tf_dtm.todense(),
                     columns=vectorizer.get_feature_names_out(),
                     index=tweets.index)


In [None]:
# Complete the boolean masks 
positive_index = tweets[tweets['airline_sentiment'] == 'positive'].index
negative_index = tweets[tweets['airline_sentiment'] == 'negative'].index

In [None]:
# Complete the following two lines
pos = tfidf.loc[positive_index].mean().sort_values(ascending=False).head(10)
neg = tfidf.loc[negative_index].mean().sort_values(ascending=False).head(10)

In [None]:
pos.plot(kind='barh', 
         xlim=(0, 0.18),
         color='cornflowerblue',
         title='Top 10 terms with the highest mean tf-idf values for positive tweets');

In [None]:
neg.plot(kind='barh', 
         xlim=(0, 0.18),
         color='darksalmon',
         title='Top 10 terms with the highest mean tf-idf values for negative tweets');

How do you interpret these two plots? Are there any words that don't really make sense to you? Do the results suggest a need for any additional preprocessing?

## Challenge 6: Doesn't Match

We have a list of tuples for coffee-noun pairs. Let's find out which coffee drink is most commonly associated with the word "coffee," and which one is not. Complete the for loop to calculate the cosine similarity between each pair.

In [None]:
import gensim
import gensim.downloader as api
from gensim.models import KeyedVectors
wv = KeyedVectors.load_word2vec_format('Data/GoogleNews-vectors-negative300.bin', binary=True)


In [None]:
coffee_nouns = [
    ('coffee', 'espresso'),
    ('coffee', 'cappuccino'),
    ('coffee', 'latte'),
    ('coffee', 'americano'),
    ('coffee', 'irish'),
]

In [None]:
# Get cosine similarities between each pair
for w1, w2 in coffee_nouns:
    similarity = wv.similarity(w1, w2)
    print(f"{w1}, {w2}, {similarity}")

Next, look up the documentation for the [`doesnt_match`](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#word2vec-demo) function. We will use it to identify the verb in the following list (one cell below) that does not seem to belong.

Use `doesnt_match` to find the verb that is unlikely to fit within the group.

In [None]:
coffee_verbs = ['brew', 'drip', 'pour', 'make', 'grind', 'roast']

In [None]:
# Find the word that doesn't belong to the list
verb_dosent_match = wv.doesnt_match(coffee_verbs)
verb_dosent_match

## Challenge 7: Gender bias in word embeddings

[Bolukbasi et al. (2016)](https://arxiv.org/pdf/1607.06520) is a thought-provoking investigation of gender bias in word embeddings. They primarily focus on word analogies, especially those that reveal gender stereotyping. Let run a couple examples discussed in the paper, using the `most_similiar` function we've just learned. 

The following code block contains a few examples we can pass to the `positive` argument: we want the output to be similar to, for example, `woman` and `chairman`, and in the meantime, we are also specificying that it should be dissimilar to `man`. We'll print the top result by indexing to the 0th item. 

Let's complete the following for loop.

In [None]:
positive_pair = [['woman', 'chairman'],
                 ['woman', 'doctor'], 
                 ['woman', 'computer_programmer'],
                 ['woman', 'pilot']]
negative_word = 'man'

In [None]:
# Get the most similar word given positive and negative examples
for example in positive_pair:
    result = wv.most_similar(positive=example, negative=negative_word)
    print(f"man is to {example[1]} as woman is to {result[0][0]}")

**Question**: What do you find? Are these results surprising?