# Word Shift Graphs — Comparing Cities

In this notebook, we will be comparing how people talk about Seattle vs. other cities — Portland, Chicago, LA, or Beijing — on Reddit. To do so, we will be using the `Shifterator` package, and we will be making word shift graphs.

What we need to make word shift graphs with `Shifterator` is **two dictionaries for our two corpora** — dictionaries that have keys as words and values as how many times the words show up in the corpora. To make these dictionaries, we will go over how to use `Counter()` and how to transform Pandas DataFrame columns into dictionaries of word counts.

## Update a Counter

In [2]:
from collections import Counter

In [3]:
fruits = ["apple", "apple", "banana", "kiwi", 'kiwi', "kiwi", 'banana']

By using Counter, we can count items in a collection, such as a list or dictionary, and return a dictionary of values and counts.

In [5]:
Counter(fruits)

Counter({'apple': 2, 'banana': 2, 'kiwi': 3})

We can also add items to a Counter with the `update()` method.

In [6]:
# Make an empty counter
fruit_counter = Counter()

In [7]:
# Add to it!
fruit_counter.update(fruits)

In [8]:
fruit_counter

Counter({'apple': 2, 'banana': 2, 'kiwi': 3})

We can also add a couple of oranges and a banana to our counter.

In [33]:
fruit_counter.update(['orange', 'orange', 'banana'])

In [34]:
fruit_counter

Counter({'apple': 2, 'banana': 3, 'kiwi': 3, 'orange': 2})

In [36]:
fruit_counter.most_common(2)

[('banana', 3), ('kiwi', 3)]

## Make a Word Tokenizer

Make sure to run all of these cells and create these variables. If you're unsure, check your Variable Inspector.

In [20]:
from collections import Counter
import re

In [22]:
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp']

Complete the code below to create a function that tokenizes any text and returns a list of words.

In [23]:
def tokenize_words(text):
    lowercase_text = text.lower()
    split_words = re.split("\W+", lowercase_text)
    # Your code here
    return meaningful_words

In [24]:
tokenize_words("I love Seattle!!!")

['love', 'seattle', '']

## Seattle vs. Portland

In [12]:
import pandas as pd
pd.options.display.max_colwidth = 400

In [13]:
seattle_df = pd.read_csv('../reddit-data/Seattle-Reddit-Submissions-1k.csv', parse_dates=['date'])

Examine 10 random posts

In [None]:
# Your code here

In [14]:
pdx_df = pd.read_csv('../reddit-data/Portland-Reddit-Submissions-1k.csv', parse_dates=['date'])

Examine 10 random posts

In [None]:
# Your code here

## Transform a Series Into a List

We can transform a column into a list with `.to_list()`

In [None]:
seattle_df['title'].to_list()

In [15]:
seattle_posts = seattle_df['title'].to_list()

In [16]:
pdx_posts = pdx_df['title'].to_list()

Let's loop through `seattle_posts`, tokenize each post, and print the list of tokenized words.

In [47]:
# Your code here
    tokenize_words(text)

['seattle', 'votes', 'decriminalize', 'psilocybin', 'similar', 'substances']
['diamantenhände', 'seattle', 'german', 'market', 'open', '']
['anyone', 'else', 'enjoy', 'comparing', 'real', 'seattle', 'game', 'pioneer', 'place', '']
['ace', 'hardware', 'lake', 'city', 'seattle', 'enforces', 'mask', 'policy']
['diamantenhände', 'seattle', 'german', 'market', 'open', '']
['fight', 'mask', 'mandate', 'seattle', 'ace', 'hardware', '']
['ace', 'hardware', 'employee', 'seattle', 'wa', 'tries', 'enforce', 'mask', 'mandate', 'bat']
['', 'arnold', 'seattle', 'seahawks', 'jamal', 'adams', 'expected', 'participate', 'mandatory', 'minicamp', 'looks', 'sign', 'contract', 'extension']
['diamantenhände', 'seattle', 'german', 'market', 'open', '']
['eagle', 'taking', 'walk', 'rain', 'low', 'tide', 'west', 'seattle', '']
['new', '315', 'ship', 'shore', 'cranes', 'inbound', 'seattle', 'yesterday', 'aboard', 'zhen', 'hua', '36', 'cranes', 'welded', 'deck', 'ship', 'former', 'tanker', '50', 'ballast', 'tank

Let's make an empty Counter, loop through the Reddit posts, and add the tokenized words to the Counter for each post.

In [25]:
seattle_counts = Counter()

for post in seattle_posts:
    words = tokenize_words(post)
    seattle_counts.update(words)

In [56]:
pdx_counts = # Your code here

for post in pdx_posts:
    # Your code here — tokenize the words and assign the resulting list to a variable
    pdx_counts.update(words)

Check to see if you made the dictionaries correctly:

In [26]:
seattle_counts

Counter({'seattle': 1551,
         'votes': 7,
         'decriminalize': 3,
         'psilocybin': 1,
         'similar': 1,
         'substances': 1,
         'diamantenhände': 4,
         'german': 4,
         'market': 10,
         'open': 12,
         '': 1170,
         'anyone': 8,
         'else': 6,
         'enjoy': 9,
         'comparing': 1,
         'real': 16,
         'game': 78,
         'pioneer': 2,
         'place': 13,
         'ace': 5,
         'hardware': 3,
         'lake': 7,
         'city': 76,
         'enforces': 1,
         'mask': 3,
         'policy': 3,
         'fight': 5,
         'mandate': 2,
         'employee': 3,
         'wa': 93,
         'tries': 1,
         'enforce': 1,
         'bat': 4,
         'arnold': 1,
         'seahawks': 85,
         'jamal': 1,
         'adams': 1,
         'expected': 5,
         'participate': 1,
         'mandatory': 1,
         'minicamp': 1,
         'looks': 15,
         'sign': 15,
         'contract': 5,
   

## Word Shift Graphs — Proportion Shift

Nice! Now that we have our two dictionaries, we can use `Shifterator`.

## Install and Import Shifterator

In [None]:
!pip install shifterator

In [30]:
import shifterator as sh

import warnings
warnings.filterwarnings("ignore")

> The easiest word shift graph that we can construct is a proportion shift. If 𝑝(1)𝑖 is the relative frequency of word 𝑖 in the first text, and 𝑝(2)𝑖 is its relative frequency in the second text, then the proportion shift calculates their difference:

> 𝛿𝑝𝑖=𝑝(2)𝑖−𝑝(1)𝑖.
> If the difference is positive (𝛿𝑝𝑖>0), then the word is relatively more common in the second text. If it is negative (𝛿𝑝𝑖<0), then it is relatively more common in the first text. We can rank words by this difference and plot them as a word shift graph.

## 🛑  Pause!
Before you run this cell, take a few minutes and make predictions about what words you think are going to be the most relatively frequent in Seattle Reddit posts and Portland Reddit posts. Jot down a few thoughts.

In [None]:
# Predictions and thoughts

## 🚦 Go!

In [None]:
proportion_shift = sh.ProportionShift(type2freq_1 = #Dictionary 1,
                                        type2freq_2= #Dictionary 2)

proportion_shift.get_shift_graph(system_names = ['Seattle', 'Portland'],  width=15, height=15, cumulative_inset=False,  top_n = 50)

Check to see where certain words show up in the Seattle Reddit data:

In [None]:
word_filter = seattle_df['title'].str.contains('Insert a word here!!', case=False)
seattle_df[#Your code here]

Check to see where certain words show up in the PDX Reddit data:

In [None]:
word_filter = pdx_df['title'].str.contains('Insert a word here!!', case=False)
pdx_df[#Your code here]

### Top 50 Words with Highest Relevant Frequency for Portland

To sort the words by score, we can use `.get_shift_scores() with a Counter. To select just the top 50, we need to use list slicing.

In [None]:
Counter(proportion_shift.get_shift_scores()).most_common()#Your code here -- slice the top 50

### Top 50 Words with Highest Relevant Frequency for Seattle

To sort the words by score, we can use `.get_shift_scores() with a Counter. To select just the bottom 50, we need to use list slicing.

In [None]:
Counter(proportion_shift.get_shift_scores()).most_common()#Your code here -- slice the bottom 50

## 🛑  Interpretation & Analysis

What kind of claims could you make about the Seattle vs. Reddit posts based on the word shift graph of relative word frequency? Try to formulate it in one or two sentences.

In [None]:
# Your claims here

## Save a Word Shift Graph

To save a word shift graph, we simply need to add a `filename`.

In [None]:
proportion_shift = sh.ProportionShift(type2freq_1 = seattle_counts,
                                        type2freq_2= pdx_counts)

proportion_shift.get_shift_graph(system_names = ['Seattle', 'Portland'],  width=15, height=15, cumulative_inset=False,  top_n = 50, filename='SeattlevPDX.png')

## Word Shift Graph — Entropy

In [None]:
entropy_shift = sh.EntropyShift(type2freq_1 = #Dictionary 1,
                                type2freq_2= #Dictionary 2,
                                base=2)

entropy_shift.get_shift_graph(system_names = ['Seattle', 'Portland'],  width=15, height=15, cumulative_inset=False,  top_n = 50)

## Word Shift Graph — Sentiment Analysis

We can also use a sentiment lexicon to see which corpora is more or less positive, and we can see which words contribute to that score. Here we will use the labMT Sentiment Lexicon, where 1 is the least happy and 9 is the most happy.

In the graph below, the `type2freq_2` dictionary will be the "reference" corpora.

In [None]:
sentiment_shift = sh.WeightedAvgShift(type2freq_1 = #Dictionary 1,
                                type2freq_2= Dictionary 2,
                                type2score_1='labMT_English',
                                stop_lens=[(4,6)]
                                )

sentiment_shift.get_shift_graph(system_names = ['Seattle', 'Portland'],  width=15, height=15, cumulative_inset=False, top_n =50)

## Top 50 Most Positive or Negative Words

In [None]:
Counter(entropy_shift.get_shift_scores()).most_common()#Your code here

## 🛑  Interpretation & Analysis

What kind of claims, if any, could you make about the Seattle vs. Reddit posts based on the word shift graph of sentiment difference? Try to formulate it in one or two sentences.

In [None]:
# Your claims here

## Compare Two Different Cities (Or Two Corpora)

In [34]:
ls ../reddit-data/

Ask-Men-Reddit-Submissions-1k.csv    LA-Reddit-Submissions-1k.csv
Ask-Women-Reddit-Submissions-1k.csv  Portland-Reddit-Submissions-1k.csv
Beijing-Reddit-Submissions-1k.csv    Rogue-One-Reddit-Submissions-1k.csv
Chicago-Reddit-Submissions-1k.csv    Seattle-Reddit-Submissions-1k.csv
Dune-Reddit-Submissions-1k.csv


In [15]:
city1_df = pd.read_csv('../reddit-data/Seattle-Reddit-Submissions-1k.csv', parse_dates=['date'])
city1_posts = city1_df['title'].to_list()

city1_counts = Counter()

for post in city1_posts:
    words = tokenize_words(post)
    city1_counts.update(words)

In [49]:
city2_df = pd.read_csv('../reddit-data/Portland-Reddit-Submissions-1k.csv', parse_dates=['date'])
city2_posts = city1_df['title'].to_list()

pdx_counts = Counter()

for post in city2_posts:
    words = tokenize_words(post)
    city2_counts.update(words)

In [None]:
proportion_shift = sh.ProportionShift(type2freq_1 = #Dictionary 1,
                                        type2freq_2= #Dictionary 2)

proportion_shift.get_shift_graph(system_names = ['City1', 'City2'],  width=15, height=15, cumulative_inset=False,  top_n = 50)