<br><br><font color="gray">INTEG 440 / 640<br>MODULE 8 of *Doing Computational Social Science*</font>


# <font color="green" size=40>PROCESSING NATURAL <br>LANGUAGE DATA</font>

<br>

Dr. [John McLevey](http://www.johnmclevey.com)    
Department of Knowledge Integration   
Department of Sociology & Legal Studies     
University of Waterloo         

<hr>

* INTEG 440 (Undergraduate): This module is worth <font color='#437AB2'>**8%**</font> of your final grade. The questions in this module add up to 10 points. 
* INTEG 640 (Graduate): This module is worth <font color='#437AB2'>**5%**</font> of your final grade. The questions in this module add up to 10 points. 

<hr>

# Table of Contents 

* [Overview](#o)
* [Learning Outcomes](#lo) 
* [Prerequisite Knowledge](#pk) 
* [Assigned Readings](#ar) 
* [Question Links](#ql)
* [Packages Used in this Module](#packs)
* [Data Used in this Module](#data)
* [**Content Analysis and Computation**](#cac)
* [**Fundamentals of Natural Language Processing**](#fundamentals)
* [**The `spacy` Pipeline and `Doc` Object**](#spacy)
* [References](#refs)

<hr>   

# Overview <a id='o'></a>

In this module, you will learn to use the package `spaCy` for fast and accurate natural language processing and for exploratory text analysis. We will cover fundamental processing tasks such as tokenization, removing stopwords, normalizing text by lemmatization, tagging words by their part-of-speech (e.g. nouns, noun chunks, verbs, adjectives), and extracting information about named entities (e.g. people, places, organizations). 

<hr>

# Learning Outcomes  <a id='lo'></a>

Upon successful completion of this module, you will be able to: 

1. Explain how natural language processing and computational text analysis fits into established traditions of content analysis in the social sciences
2. Provide a high-level overview of the fundamental design differences between the most widely-used packages for natural language processing  
3. Explain the role that pre-processing plays in text analysis    
4. Pre-process text by:  
    4.1 removing stopwords   
    4.2 selecting parts-of-speech (e.g. nouns) to include in your analysis   
    4.3 detect $n$-grams   
    4.4 normalize text by lemmatization   
    4.5 extract information about named entities (people, places, organizations, etc.)   

<hr>

# Prerequisite Knowledge  <a id='pk'></a>

This module requires a basic level of comfort working with strings, lists, matrices, and `Pandas` dataframes. 

<hr>

# Assigned Readings  <a id='ar'></a>

This module assumes you have completed the assigned readings, which are listed immediately below. The readings provide a detailed explanation of the core concepts covered in this module. 

* <font color="green">Chapter 14 "Content Analysis and Computation" from *Doing Computational Social Science*.</font> 
* <font color="green">Chapter 15 "Natural Language Processing" from *Doing Computational Social Science*.</font> 

As always, I recommend that you (1) complete the assigned readings, (2) attempt to complete this module without consulting the readings, making notes to indicate where you are uncertain, (3) go back to the readings to fill in the gaps in your knowledge, and finally (4) attempt to complete the parts of this module that you were unable to complete the first time around.

This module notebook includes highly condensed overviews of *some* of the key material from the assigned reading. This is intended as a *supplement* to the assigned reading, *not as a replacement for it*. These high-level summaries do not contain enough information for you to successfully complete the exercises that are part of this module, and they do not cover every relevant topic. 

<hr>

# Question Links <a id='ql'></a>

Make sure you have answered all of the following questions before submitting this notebook on LEARN. 

1. [Question 1](#yt1) 
2. [Question 2](#yt2) 
3. [Question 3](#yt3) 
4. [Question 4](#yt4) 
5. [Question 5](#yt5) 
6. [Question 6](#yt6) 
7. [Question 7](#yt7) 
8. [Question 8](#yt8) 

<hr>

# Packages Used in this Module  <a id='packs'></a>

The cell below imports the packages that are necessary to complete this module. If there are any additional packages you wish to import, you may add them to this import cell. 

In [2]:
!pip install gensim
import pandas as pd
import spacy 
nlp = spacy.load('en_core_web_sm')

from gensim.models.phrases import Phrases, Phraser

Collecting gensim
  Downloading gensim-3.8.1-cp37-cp37m-manylinux1_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 64 kB/s s eta 0:00:01
[?25hCollecting smart-open>=1.8.1
  Downloading smart_open-1.9.0.tar.gz (70 kB)
[K     |████████████████████████████████| 70 kB 2.6 MB/s  eta 0:00:01
Collecting boto3
  Downloading boto3-1.12.16-py2.py3-none-any.whl (128 kB)
[K     |████████████████████████████████| 128 kB 45.9 MB/s eta 0:00:01
Collecting s3transfer<0.4.0,>=0.3.0
  Downloading s3transfer-0.3.3-py2.py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 1.6 MB/s  eta 0:00:01
[?25hCollecting jmespath<1.0.0,>=0.7.1
  Downloading jmespath-0.9.5-py2.py3-none-any.whl (24 kB)
Collecting botocore<1.16.0,>=1.15.16
  Downloading botocore-1.15.16-py2.py3-none-any.whl (5.9 MB)
[K     |████████████████████████████████| 5.9 MB 40.8 MB/s eta 0:00:01
Collecting docutils<0.16,>=0.10
  Downloading docutils-0.15.2-py3-none-any.whl (547 kB)
[K     |████████

# Data Used in this Module <a id='data'></a>

Most of this module (and the three that follow) will use Megan Risdale's (2016) dataset on fake and real news ([obtained from Kaggle](https://www.kaggle.com/mrisdal/fake-news)). We will load it from our data subdirectory.

In [3]:
df = pd.read_csv('data/fake_news.csv')
df.sample(15)

Unnamed: 0.1,Unnamed: 0,title,text,label
5554,7618,JUST IN: Republicans Sued Over Trump’s Call To...,\nThe Democratic National Committee just hau...,FAKE
5487,3531,Iowa Christians struggle to square faith with ...,"Des Moines, Iowa (CNN) As Christians, they wan...",REAL
5540,6759,Fukushima – The Untouchable Eco-Apocalypse No ...,Waking Times – by Alex Pietrowski \nThe most i...,FAKE
4752,35,Arizona first in nation to require patients be...,Arizona will become the first state in the nat...,REAL
4475,3333,Clinton Foundation received subpoena from Stat...,Investigators with the State Department issued...,REAL
281,3592,"In France, a growing debate over why some spee...",They tease terrorists. The prophet Muhammad cr...,REAL
3039,8924,"Without Bold Agenda, Warn Progressives, A Clin...","Without Bold Agenda, Warn Progressives, A Clin...",FAKE
3257,4700,What WikiLeaks hack says about Clinton: Our view,Now we know why she didn't want those Wall Str...,REAL
97,3255,Give Social Security recipients a CEO-style raise,(CNN) On Veterans Day we recognize and honor t...,REAL
1189,2092,"On climate change, ideological and partisan po...","Later this week, Pope Francis will reportedly ...",REAL


# Content Analysis and Computation <a id='cac'></a>

Chapter 14 ('Content Analysis and Computation') from *Doing Computational Social Science* provides a high-level overview of the goals of content analysis in the social sciences, and the challenges and opportunities associated with digital text data and new computational methods. Before getting into specific methods, it is important that you understand the differences between two fundamentally different types of computational text analysis (supervised and unsupervised), and how they can be combined with careful human reading and interpretation. 

### <font color="green">YOUR TURN! (Question 1)</font> <a id='yt1'></a>

Question is Worth: <font color="green">1.25 points</font>

In the cell block below, compare the uses of and workflows for supervised and unsupervised learning. Explain how both relate to longstanding traditions of content analysis in the social sciences. 

Supervised learning helps to scale operationalized content and models of content analysis. They help for researchers to evaluate theory and have much control of the content machines will effectively "learn". For points of analysis for which you might already know of the end result, supervised learning helps to effectively develop a system that can be trained on criteria to evaluate data and classify effectively in the end. Within this methodology, there is more assurance that the data can be referenced to a ground truth for a deeper and sound approach of analysis. Essentially, we can control many things within the data here. 

Unsupervised learning uncovers meaning within data that has previously not been assessed for patterns. Without pertaining labels within the data, this approach implements inductive methods to develop new patterns or find meaning hidden within the data. In its inherent manner, discover of new theory and pattern is relevant to the world of social sciences as long as it can be interpreted correctly. In comparison to supervised learning, this will require some digging and appropriate ground of theoretical application which will allow for some interesting new patterns.

### <font color="green">YOUR TURN! (Question 2)</font> <a id='yt2'></a>

Question is Worth: <font color="green">1.25 points</font>

Methodologists have recently been writing about combining unsupervised learning, supervised learning, and 'guided' human reading and interpretation. In the cell block below, explain (1) why it is essential to preserve the role of human interpretation in computational text analysis, and (2) why it can be useful to combine unsupervised and supervised learning in formal frameworks such as Laura Nelson's (2017) '[Computational Grounded Theory](https://journals.sagepub.com/doi/abs/10.1177/0049124117729703)' (which was discussed in the assigned reading). 

Why is it essential to preserve the role of human interpretation in computational text analysis?

Machines and humans are inherently different and can operate in harmony if used in the right manner. The context and level of interpretation of the intracacies of a language that humans possess is unmatched by any algorithm that could be used by machine learning to uncover meaning. Computers do mathematical operations on data rather than understand the meaning and the linguistics associated. Computers are used for scaled operations while humans are good for being . the ground truth of reference. 

Why can it be useful to combine unsupervised and supervised learning in formal frameworks?

Unsupervised learning methods can help to discover patterns and concepts that can act as ground truth for the data once validated by researchers. In this form, the process of labelling and ground truth development to structure data is effectively removed. With a mixture of guided deep reading, essentially supervised methodologies, to avoid outliers leading to biases and making the process of combination easier with interpretation of the data. 


# Fundamentals of Natural Language Processing <a id='fundamentals'></a>

## Getting to Know `spacy` 

### <font color="green">YOUR TURN! (Question 3)</font> <a id='yt3'></a>

Question is Worth: <font color="green">.5 points</font>

As discussed in the assigned reading, it is important to understand the design philosophy behind `spaCy` and how it differs from alternatives, including the more established package `nltk`. Before we start getting into how `spaCy` works, take a few moments to summarize the design philosophy for `spaCy` in the text cell below. Be sure to compare this design philosophy with the one guiding `nltk`.

 In relative comparison to `nltk`, `spaCy` owner has mentioned that the package contains inefficient algorithms that were orginally designed for teaching NLP. Inherently, they provide a great reference for beginners looking to learn the world of NLP but do not provide the speed and performance necessary for those that need to do practical computations. `spaCy` provides many constraints as a package in order to account for higher performance processing in context for production level NLP rather than research-oriented. This helps in making informed decision making for some scientists but leaves a lot to answer for those that need to go outside the boundaries and explore. Versions of `spaCy`can also affect the type of analysis done since packages are always constantly updating. As long as you're able to consider how the data research will be done and are mindful of the constraints in the design of the package, you can tweak the algorithm and train for the data how you would like for it to be trained. Alongside, `spaCy` believes in the quality of data being paramount to the performance of NLP for which it provides to tools to develop meticulous and caring measures of data annotation.

# The `spacy` Pipeline and `Doc` Object <a id='spacy'></a>

The following abstract is from Muller, Sampson, and Winter's (2018) article "[Environmental Inequality: The Social Causes and Consequences of Lead Exposure](https://www.annualreviews.org/doi/10.1146/annurev-soc-073117-041222)," published in the *Annual Review of Sociology*. Let's use this example to illustrate some basic natural language processing (NLP) concepts and tasks before moving on to a bigger example.

> In this article, we review evidence from the social and medical sciences on the causes and effects of lead exposure. We argue that lead exposure is an important subject for sociological analysis because it is socially stratified and has important social consequences -- consequences that themselves depend in part on children's social environments. We present a model of environmental inequality over the life course to guide an agenda for future research. We conclude with a call for deeper exchange between urban sociology, environmental sociology, and public health, and for more collaboration between scholars and local communities in the pursuit of independent science for the common good.

### <font color="green">YOUR TURN! (Question 4)</font> <a id='yt4'></a>

Question is Worth: <font color="green">.5 points</font>

When you read this abstract, you know where words begin and end, and where sentences begin and end. It is more challenging for a computer to be able to tell where these "tokens" begin and end. Why? What are some common challenges for a computer "reading" text data? 

It's hard to assess where tokens begin or end because of different context and meaning of different rules in grammar can make it tricky for computers to assess where the beginning and end lies. Things like punctuation vs acronym, contraction or expansion in token development, or emoji can be hard to interpret. To combat this context and language-specific tokenization rules help to make intepretation easier overall alongside using model-approaches for analysis. Addditionally, understanding parts of the speech are important in order for machines to accurately learn. 

Pre-processing tasks like selecting words based on their part-of-speech always degrades the reading experience for a human because it involves stripping out information that makes it easier for humans to understand the meaning of any individual text. But when we want a computer to tell us something about the content of many texts in a document collection, the same pre-processing tasks are *essential* for producing informative results. Effective pre-processing is all about knowing what kinds of information needs to be preserved or removed to improve the ability of the computer to "read" many texts. That may or may not involve tasks like selecting nouns and verbs, but it will almost always include tasks like removing stopwords, punctuation, and normalizing text. 

Let's begin by walking through some basic pre-processing on the abstract introduced above. The first thing we will do is create a string object containing the abstract, and then we will feed it into the `spaCy` pipeline `nlp()`, which we defined right under `import spacy` when we were importing the packages used in this lesson. 

In [5]:
ab = "In this article, we review evidence from the social and medical sciences on the causes and effects of lead exposure. We argue that lead exposure is an important subject for sociological analysis because it is socially stratified and has important social consequences -- consequences that themselves depend in part on children's social environments. We present a model of environmental inequality over the life course to guide an agenda for future research. We conclude with a call for deeper exchange between urban sociology, environmental sociology, and public health, and for more collaboration between scholars and local communities in the pursuit of independent science for the common good."

proc = nlp(ab)

When we process text by running it through the `spaCy` pipeline, `spaCy` stores the information we want in a `Doc` object. If we want to see the tokenized sentences, for example, we can iterate over the sentences in the `Doc` object and print them to screen. In this example, our `Doc` object is stored in the variable `proc.`

In [6]:
for sent in proc.sents:
    print(sent)
    print('\n')

In this article, we review evidence from the social and medical sciences on the causes and effects of lead exposure.


We argue that lead exposure is an important subject for sociological analysis because it is socially stratified and has important social consequences -- consequences that themselves depend in part on children's social environments.


We present a model of environmental inequality over the life course to guide an agenda for future research.


We conclude with a call for deeper exchange between urban sociology, environmental sociology, and public health, and for more collaboration between scholars and local communities in the pursuit of independent science for the common good.




We can iterate over tokens (e.g. sentences, words) for a variety of important text processing tasks, including normalizing text, removing stopwords, identifying parts-of-speech, and extracting named entities. For example, we can use normalized words rather than the original words by iterating over the words in the abstract and adding each word's lemma to a list. This time we will use `list comprehension` to iterate over the tokens. 

In [7]:
lemmas = [token.lemma_ for token in proc]
print(lemmas)

['in', 'this', 'article', ',', '-PRON-', 'review', 'evidence', 'from', 'the', 'social', 'and', 'medical', 'science', 'on', 'the', 'cause', 'and', 'effect', 'of', 'lead', 'exposure', '.', '-PRON-', 'argue', 'that', 'lead', 'exposure', 'be', 'an', 'important', 'subject', 'for', 'sociological', 'analysis', 'because', '-PRON-', 'be', 'socially', 'stratified', 'and', 'have', 'important', 'social', 'consequence', '--', 'consequence', 'that', '-PRON-', 'depend', 'in', 'part', 'on', 'child', "'s", 'social', 'environment', '.', '-PRON-', 'present', 'a', 'model', 'of', 'environmental', 'inequality', 'over', 'the', 'life', 'course', 'to', 'guide', 'an', 'agenda', 'for', 'future', 'research', '.', '-PRON-', 'conclude', 'with', 'a', 'call', 'for', 'deep', 'exchange', 'between', 'urban', 'sociology', ',', 'environmental', 'sociology', ',', 'and', 'public', 'health', ',', 'and', 'for', 'more', 'collaboration', 'between', 'scholar', 'and', 'local', 'community', 'in', 'the', 'pursuit', 'of', 'independe

Similarly, we can iterate over the words in the abstract and check to see if the word is a stopword. If not, we can add it to a new list. 

In [8]:
wo_stops = [token for token in proc if token.is_stop == False]
print(wo_stops)

[article, ,, review, evidence, social, medical, sciences, causes, effects, lead, exposure, ., argue, lead, exposure, important, subject, sociological, analysis, socially, stratified, important, social, consequences, --, consequences, depend, children, social, environments, ., present, model, environmental, inequality, life, course, guide, agenda, future, research, ., conclude, deeper, exchange, urban, sociology, ,, environmental, sociology, ,, public, health, ,, collaboration, scholars, local, communities, pursuit, independent, science, common, good, .]


Extracting words by their part-of-speech is no different. First, let's print the part-of-speech for each word in the abstract. Then let's make a list that includes only the nouns. 

In [9]:
for item in proc:
    print(item.text + '({})'.format(item.pos_))

In(ADP)
this(DET)
article(NOUN)
,(PUNCT)
we(PRON)
review(VERB)
evidence(NOUN)
from(ADP)
the(DET)
social(ADJ)
and(CCONJ)
medical(ADJ)
sciences(NOUN)
on(ADP)
the(DET)
causes(NOUN)
and(CCONJ)
effects(NOUN)
of(ADP)
lead(NOUN)
exposure(NOUN)
.(PUNCT)
We(PRON)
argue(VERB)
that(DET)
lead(NOUN)
exposure(NOUN)
is(AUX)
an(DET)
important(ADJ)
subject(NOUN)
for(ADP)
sociological(ADJ)
analysis(NOUN)
because(SCONJ)
it(PRON)
is(AUX)
socially(ADV)
stratified(ADJ)
and(CCONJ)
has(AUX)
important(ADJ)
social(ADJ)
consequences(NOUN)
--(PUNCT)
consequences(NOUN)
that(SCONJ)
themselves(PRON)
depend(VERB)
in(ADP)
part(NOUN)
on(ADP)
children(NOUN)
's(PART)
social(ADJ)
environments(NOUN)
.(PUNCT)
We(PRON)
present(VERB)
a(DET)
model(NOUN)
of(ADP)
environmental(ADJ)
inequality(NOUN)
over(ADP)
the(DET)
life(NOUN)
course(NOUN)
to(PART)
guide(VERB)
an(DET)
agenda(NOUN)
for(ADP)
future(ADJ)
research(NOUN)
.(PUNCT)
We(PRON)
conclude(VERB)
with(ADP)
a(DET)
call(NOUN)
for(ADP)
deeper(ADJ)
exchange(NOUN)
between(ADP)
urb

### <font color="green">YOUR TURN! (Question 5)</font> <a id='yt5'></a>

Question is Worth: <font color="green">1 point</font>

In the cell below, use either a for loop or list comprehension to iterate over the tokens and add the word to a list of nouns if the word is a noun. 

In [10]:
## Your Answer Here ## 

### BEGIN SOLUTION 
nouns = [item.text for item in proc if item.pos_ == 'NOUN']
### END SOLUTION 

print(nouns)

['article', 'evidence', 'sciences', 'causes', 'effects', 'lead', 'exposure', 'lead', 'exposure', 'subject', 'analysis', 'consequences', 'consequences', 'part', 'children', 'environments', 'model', 'inequality', 'life', 'course', 'agenda', 'research', 'call', 'exchange', 'sociology', 'sociology', 'health', 'collaboration', 'scholars', 'communities', 'pursuit', 'science', 'good']


`spaCy` is also able to identify noun chunks, or "phrases." 

In [16]:
for item in proc.noun_chunks:
  print(item.text)

()

As you can see, `spaCy` really simplifies key tasks in natural language processing. Knowing how, when, and why to do these and other tasks is the secret to getting good results when you are doing automated text analysis. **If you don't pre-process your text, you will not get good results, no matter how sophisticated your models are.** 

How, then, can we combine these methods into a simple text pre-processing step? And how do we scale this up to a collection of abstracts (or any other text data) rather than a single string?

In the cells below, I have provided some code to pre-process text data from a sample of our fake news dataset. We imported that data into memory at the start of this notebook. 

The cell immediately below this one will take a bit of time to run because of all the work `spaCy` is doing to parse the text. Once it has finished, run the next code cell to actually pre-process the text. 

In [12]:
fake_sample = df[df['label'] == 'FAKE'].sample(200)
real_sample = df[df['label'] == 'REAL'].sample(200)
sampled_news = pd.concat([fake_sample, real_sample])
len(sampled_news)

400

In [13]:
text = sampled_news['text']

Remember the next step will take a while to run because `spaCy` is doing all of the computationally intensive work up front. 

In [14]:
processed = [nlp(t) for t in text]

Now let's write a bit of code to iterate over the list of texts that have been parsed by `spaCy`. Recall from the lesson on computational thinking that we generally want to develop solutions that can cover a range of related problems. In this case, we want to write code that can be used to process any text we provide it, not *just* the text we gave it in any single instance. Below, we will write a function to do this work. 

Our goal is to return a list of lists. Each abstract in our dataset will be represented by a list of nouns and adjectives. That list will then be appended to a list of abstracts in the dataset. *In addition to our list of nouns and adjectives*, we will extract a list of "named entities" of the type `person` using `spacy`'s named entity recognition models. The named entities will also be returned as a list. 

In [18]:
def prepare_text(list_of_processed_texts):
    """
    Quickly grab entities and lemmas of non-stopword nouns and adjectives. 
    """
    analysis_text = []
    named_entities = []
    
    types = ['NOUN', 'ADJ'] 
    for doc in list_of_processed_texts: 
        ents = [ent.text for ent in doc.ents if ent.label_ is 'PERSON' and ' ' in ent.text]    
        reduced = [token.lemma_ for token in doc if token.is_stop is False and token.pos_ in types]
        
        analysis_text.append(" ".join(reduced))
        named_entities.append(list(set(ents)))
    return analysis_text, named_entities

### <font color="green">YOUR TURN! (Question 6)</font>  <a id='yt6'></a>

Question is Worth: <font color="green">2 points</font>
    
In the text cell below, explain *in plain language* what each line of the function in the code block above is doing. Be sure to explain what each line takes in (i.e. the inputs), what it does to those inputs, and what it returns (i.e. the outputs). What exactly does the full function return? 

For lists of processed texts that have gone through the `nlp()` function already, essentially a list of doc objects. Within the function, first lists are initialized for analysis of text and pertaining entities of analysis. A types list is defined to consider the types for the particular analysis and their specific relevance. In this case, only nouns and adjectives will be considered. For each doc object within the sent in function paramter, named entity text (on the condition that entity label is PERSON via a recognition model and text is empty with `' '`) is saved to a variable along with a list of base form tokens (in unicode) are taken from the doc for those that are considered either nouns or adjectives in a non stop-word form. Those respective lists are then appended to `analysis_text` and `named_entities` for each doc and sent back as a list of lists.

Now let's use our function. 

In [19]:
prepped = prepare_text(processed)

analysis_text = prepped[0]
named_entities = prepped[1]

Let's inspect the result by looking at the content returned for the first 5 stories.

In [20]:
analysis_text[:5]

['home shocking look shocking late number anchor baby birth baby birthright citizen city size analysis datum datum newborn illegal percent birth analysis report birth unmarried foreign woman foreign illegal birthrate drop decline birth american woman report baby unauthorized immigrant parent percent birth percent birth foreign mother share new mother teenager high % foreign % pic.twitter.com/ydjpv2ngxh @pewresearch birth foreign mother unmarried woman peak percent time rate steady woman percent study birthrate woman rise birthrate immigrant mother growth annual birth immigrant mom pic.twitter.com/doznvxxvgt @pewresearch annual number baby recent year great recession significant drop birth nationwide trajectory past decade upward birth growth number baby immigrant woman immigrant woman birth threefold increase immigrant woman birth annual number birth woman percent time period people case culture look easy answer refugee flow amnesty illegal immigrant law contrast law order wall immigra

As you can see, the results are indeed a list of lists, and the content of each is as expected.

Now, let's inspect some of the named entities `spaCy` identified. Remember that named entity recognition is much less accurate than part-of-speech tagging. 

In [21]:
named_entities[:15]

[['Donald Trump', 'Hillary Clinton'],
 ['Ed Klein',
  'John Podesta',
  'Completely Dark –',
  'James Comey',
  'Clinton Campaign',
  'Hillary Clinton',
  'Donald Trump'],
 ['Adam Yoshida',
  'Ed Timperlake',
  'Lucianne Goldberg',
  'Huma Abedin',
  'Hillary Clinton',
  'Carlos Danger',
  'Anthony Weiner'],
 ['Amir A.', 'Amir A', 'INDRA WARNES \n'],
 ['Click Here', 'Dave Hodges'],
 ['Wendy Kaufman'],
 ['Damon Smith',
  'Simon Eastwood',
  'pic.twitter.com/b0BekmI1aw \n— Mark White',
  'Newton Abbot'],
 [],
 ['Charlie Hebdo', 'Joe Walsh', 'Simon Maloy'],
 ['Mark Sykes', 'Arthur James Balfour', 'Walter Rothschild', 'Ramzy Baroud'],
 ['Anthony A Fabrikant',
  'George Bush',
  'Edward Snowden',
  'Barrack Obama',
  'Eddie L.'],
 ['Geert Wilders',
  'Ian Greenhalgh',
  'Naming Trump',
  'Al Hussein',
  'Nigel Farage'],
 ['Guest Click'],
 ['Amanda Froelich'],
 ['Vitamin E']]

### <font color="green">YOUR TURN! (Question 7)</font> <a id='yt7'></a>

Question is Worth: <font color="green">2 points</font> (1 for the code + 1 for the comparison)

In the code block below, produce a list of the 25 most frequently mentioned people for a subset of the fake news stories and the real news stories. Then compare them in the text cell below. 

In [38]:
# Your Answer Here 
freq_names = {}
for names_list in named_entities:
    for name in names_list:
        if name in freq_names:
            freq_names[name] = freq_names[name]+1
        else:
            freq_names[name] = 1

sorted_d = sorted((value, key) for (key,value) in freq_names.items())
sorted_d[::-1][:25]

[(109, 'Hillary Clinton'),
 (109, 'Donald Trump'),
 (41, 'Barack Obama'),
 (34, 'Ted Cruz'),
 (27, 'Bill Clinton'),
 (24, 'Bernie Sanders'),
 (20, 'Mitch McConnell'),
 (19, 'George W. Bush'),
 (18, 'Marco Rubio'),
 (18, 'John Kasich'),
 (18, 'Jeb Bush'),
 (15, 'Paul Ryan'),
 (14, 'Vladimir Putin'),
 (12, 'Rand Paul'),
 (11, 'Ronald Reagan'),
 (11, 'John Boehner'),
 (11, 'James Comey'),
 (10, 'Mike Pence'),
 (10, "Donald Trump's"),
 (10, 'Ben Carson'),
 (9, 'Chris Christie'),
 (8, 'Loretta Lynch'),
 (8, 'Lindsey Graham'),
 (8, 'Huma Abedin'),
 (8, 'Harry Reid')]

The most frequent ones are "surprisingly" the ones involved in American politics (or closely tied to) including notably Hillary Clinton and Donald Trump as those topping the list. One instance has been caught as a contraction with "Donald Trump's". Within fake aand real news, these are the figures that are trending and most frequent in real and fake news alike. 

### <font color="green">YOUR TURN! (Question 8)</font> <a id='yt8'></a>

Question is Worth: <font color="green">1.5 points</font>

In the code block below, write a loop or list comprehension to create a list of named entities that are 'Geopolitical Entities' (in `spaCy`, `GPE`). 

In [42]:
# Your Answer Here 
list_of_ne = []

for doc in processed:
    named_entities =  [ent.text for ent in doc.ents if ent.label_ is 'GPE']
    for ent in named_entities:
        if ent not in list_of_ne:
            list_of_ne.append(ent)
            
print(list_of_ne)

["New York City's", 'Trumpland', 'Yemen', 'Saudi Arabia', 'U.S.', 'Iraq', 'Syria', 'Honduras', 'Afghanistan', 'Israel', 'Iran', 'Russia', 'Ukraine', 'Algeria', 'Kuwait', 'United Arab Emirates', 'Oman', 'Qatar', 'US', 'Drudge', 'California', 'serfdom.', 'Gettysburg', 'Washington', 'China', 'America', 'Mexico', 'the United States', 'Bakersfield', 'Sacramento', 'D-Sacramento', 'New York', 'Virginia', 'Los Angeles', 'Little Rock', 'New York FBI', 'Brooklyn', 'Manhattan', 'Arkansas', 'Las Vegas', 'North Korea', 'Vise', 'Japan', 'South Korea', "North Korea's", 'Tokyo', 'NORTH KOREA', 'Seoul', 'Iowa', 'Aston', 'Fulham', 'France', 'Breitbart', 'Marais Stadium', 'Jerusalem', 'the Dome of the Rock Mosque', 'Old City', 'Palestine', 'UNESCO', 'Bookmark', 'Canaan', 'North Africa', 'Rome', 'CE', 'the Roman Empire', 'tapaDILDO', 'Egypt', 'tel Megiddo', 'Torah', 'OTHERS', 'East Jerusalem', 'Megiddo', 'TapaDILDO', 'a Roman Empire', 'Empire', 'Germany', 'WASHINGTON', 'ISRAEL', 'Lebanon', 'Libya', 'Somal

<hr>

# <font color="green">Do You See Something That Could be Better?</font>

I am committed to collecting student feedback to continuously improve this course for future students. I would like to invite you to help me make those improvements. 

As you worked on this module, did you notice anything that could be improved? For example, did you find a typo in the module notebook **or in the assigned reading**? Did you find the explanation of a particular concept or block of code confusing? Is there something that just isn’t clicking for you? 

If you have any feedback for the content in this module, please enter it into the text block below. I will review feedback each week and make a list of things that should be changed before the next offering. 

Please know that *nothing you say here, however critical, will impact how I evaluate your work in this course*. There is no risk that I will assign a lower grade to you if you provide critical feedback. In fact, if the feedback you provide is thoughtful and constructive, I will assign up to 3% bonus marks on your final course grade. 

Thanks for your help improving the course! 

# Your Feedback Here :-)

<hr>

# REFERENCES <a id='refs'></a>

* McLevey, John. 2020. *Doing Computational Social Science*. Sage. London, UK. 
* Muller, C., Sampson R.J., and Winter, A.S. (2018). 'Environmental Inequality: The Social Causes and Consequences of Lead Exposure.' *Annual Review of Sociology* (44) pp 263-282.
* Nelson, Laura. 2017. 'Computational Grounded Theory: A Methodological Framework.' *Sociological Methods & Research*. 1:40. 
* Risdale, Megan. 2016. "Getting Real about Fake News. Text & metadata from fake & biased news sources around the web." Dataset available on Kaggle: https://www.kaggle.com/mrisdal/fake-news 