# Metaphor identification algorithms
## from Neuman et al. 2013

Main concern/motivation: there is only one person (me) on this project, and I don't believe only me annotating is likely to be helpful/useful

### Definitions (from Krishnakumaran and Zhu 2007)

Type I metaphor: "a subject noun is associated with an object noun via a form of the copula verb ‘to be’, such as in the case of ‘God is a king’."

Type II metaphor: "the verb is the focus of the metaphorical use representing the act of a subject noun on an object noun, such as in the case of ‘The war absorbed his energy.’"

Type III metaphor: "involve an adjective-noun phrase such as ‘sweet girl’." 

### Detecting type II metaphors

Phrase = < N1, V, N2 > where V represents the (metaphorical) action of N1 on N2

1. Identify the 100 most concrete object nouns associated with the verb V in a corpus
2. Categorize the 100 nouns by using WordNet
3. Categorize the object noun N2

Layer 1:
4. If none of the object noun categories overlaps with one of the categories of the 100 nouns associated with the verb, then return METAPHORICAL

Layer 2:
5. Find the main category of the object noun using ConceptNet
6. If the main category is not included in the categories of the 100 nouns, then return METAPHORICAL; else return LITERAL

Above algorithm depends on concept called "mutual information" => have you come across this before? 
def: "a quantity that measures the mutual dependence of two random variables"

### Detecting type I metaphors
**Still reasoning this one out => manually annotating White House sample didn't yield many examples of type I metaphors

Phrase = < N1, N2 > where N1 is the subject noun ("God") and N2 is the object noun ("king")

1. Identify the categories of N1 and N2 >>> if they do not overlap, then return METAPHORICAL

Layer 1: 
2. Find main categories of N1 and N2 using ConceptNet
3. If the 2 main categories are different, then return METAPHORICAL

Layer 2 (disambiguating instances where N1 and N2 are from the same category, e.g., "my CAT is a TIGER"):
4. Identify the 100 nouns associated with N1 and N2 separately
5. Categorize the 100 nouns associated with N1 and N2 separately
6. If none of the nouns' categories overlap with each other, then return METAPHORICAL; else return LITERAL

Workflow: pass texts and source domain, identify instances there was potential metaphor usage
> Weakness: still need a person to build larger frames

Question: are there methods for identifying these larger frames (ML)? In your view, would it be possible?
> Do you have any suggestions for overcoming these challenges?

Another question: stemmer/lemmatizer/tokenizer included in NLTK >>> how come you don't use them?
Please give me some advice on best practices when it comes to corpus-based work

In [None]:
# Importing necessary libraries

#Libraries for Webscraping
import requests

#Data preparation
import pandas as pd
from datetime import datetime
import numpy as np
import string
from nltk.corpus import stopwords

#Visualization
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10,10)

In [None]:
#Importing previously written functions

import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from notebooks.functions import GetLinks, CookedSoup, CleanText

# Functions

In [None]:
#Creating function to get url links for scraping
def GetLinks(url):
    
    #Making broth
    html_text = requests.get(url)
    broth = BeautifulSoup(html_text.text, 'lxml')
    urlset = broth.find_all('a', class_='news-item__title')
    
    links = []
    
    for i in np.arange(0, len(urlset)):
        url = urlset[i]
        links.append(url.get('href'))
        
    return links

#Creating function to clean collected text data
class CleanText:
    def __init__(self, text = ''):
        self.text = text
        
         #Cleaning added text
        # obvious-ly => obvious ly (lemmatization)
        words = ''.join((filter(lambda x: x in string.printable, text)))
        words = words.replace('—', '')
        table = str.maketrans('', '', string.punctuation + '’‘' + '""' + '–' + '­')
        words = [w.translate(table).lower() for w in words.split()]
        stop_words = stopwords.words('english')
        words = list(filter(lambda w: w not in stop_words, words))
        words = list(filter(lambda w: w.isalpha(), words))
        
        self.words = words
        
#Creating function => pass list of links, give dict of scraped data

def CookedSoup(html_links):
    
    titles = []
    time = []
    btype = []
    original_text = []
    cleaned_words = []
    
    counter = np.arange(0,len(html_links))

    #Making soup of link
    for i in counter:
        html = requests.get(html_links[i])
        soup = BeautifulSoup(html.text, 'lxml')
            
    #Getting Titles
        title_html = soup.find('h1', class_='page-title topper__title news')
        
        titles.append(title_html.text.strip('\n').strip('\t').replace('\xa0', ' '))
                
    #Getting meta-data
        time_html = soup.find('time', class_='posted-on entry-date published updated')
        time_str = time_html.text
        datetime_object = datetime.strptime(time_str, '%B %d, %Y')
        time.append(datetime_object)
        
        btype_html = soup.find('a', rel= 'category tag')
        btype.append(btype_html.text)
        
    #Getting text content
        briefing_text = []
        
        for para in soup.find_all('section', class_ = 'body-content'):
            text_range = np.arange(5, len(soup.find_all('p')[5:])-1)
            
        briefing_text.append(para.text.strip('\n').strip('\t').
                             replace('\xa0', ' ').replace('\n', ''))
        
        original_text.append(briefing_text)
                        
    #Cleaning text content
        
        cleaned_text = CleanText(briefing_text[0])
        cleaned_words.append(cleaned_text.words)
        
    #And now we loop
        i = i+1
        
    #Creating dict object
    
    briefing_info = {'title': titles, 'url':html_links, 'time': time, 'btype': btype, 
                     'original_text': original_text, 'cleaned_words': cleaned_words}

    return briefing_info

# Scraping website

In [None]:
#Preparing to scrape entire website

#Making soup of initial website
url = "https://www.whitehouse.gov/briefing-room/press-briefings/"

html_text = requests.get(url)
soup = BeautifulSoup(html_text.text, 'lxml')

#Creating indices
max_page = int(soup.find_all('a', class_='page-numbers')[-1].text.replace('Page ', ''))+1

pages_list = np.arange(2, max_page)
#type(pages_list)

#Creating page url list
blank_url = 'https://www.whitehouse.gov/briefing-room/press-briefings/{page_number}'
page_number = 'page/{number}/'
page1_url = 'https://www.whitehouse.gov/briefing-room/press-briefings/'

urls = []

for page in pages_list:
    page_url = blank_url.format(page_number = page_number.format(number = str(page)))
    
    urls.append(page_url)
    
    page = page + 1
    
urls.insert(0, page1_url)

#print(urls)

#Creating WH_briefing_df
column_list = ['title', 'time', 'btype','original_text', 'cleaned_words']

WH_briefing_df = pd.DataFrame(columns = column_list)

#Filling WH_briefing_dict

dict_range = np.arange(0, max_page-1)

for i in dict_range:
    
    links = GetLinks(urls[i])
    page_dict = CookedSoup(links)
    page_df = pd.DataFrame(data = page_dict)
    
    WH_briefing_df = WH_briefing_df.append(page_df)
    
    #print(len(WH_briefing_df['titles'].index))
    
    i = i+1

press_briefings = WH_briefing_df.reset_index(drop = True)
press_briefings

# Creating CSV of scraped results
## Do not run w/o modifying document name!

from pathlib import Path  
filepath = Path('/Users/daisyliu/Desktop/Research/UTEA_covid19_metaphors/notebooks/wordclouds/data/press_briefings.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
press_briefings.to_csv(filepath)

In [None]:
#Visualization

def WordCloudCreator(df):
    
    counter = int(len(df.index))
    
    for i in np.arange(0, counter-1):
        WordCloud_Data = df.iloc[i]
        
        plt_title = str(i)
        plt_data = WordCloud_Data['cleaned_words']
        
        d = {}
        
        for word in plt_data:
            if word not in d.keys():
                d[word] = 1
            else:
                d[word] = int(d[word]) + 1
        
        wordcloud = WordCloud(background_color = 'white', max_words = 50).generate_from_frequencies(d)
        plt.imshow(wordcloud)
        plt.axis("off")
        
        plt.savefig(plt_title, format = 'png', bbox_inches = 'tight')
        
        
        i = i+1

# Generating separate dataframes
 
## Colour Coding Legend
Keywords: 'vaccin-,' 'mask-,' 'pandemic,' 'covid-', 'virus'

Note: wild card '*' replaced with '-'

### Green
Green files produce WordCloud that includes one or more of the keywords (i.e., highly relevant)

### Red
Red files produce WordClouds that do not include any of the keywords, 'cdc,' 'fda,' 'health' (i.e., irrelevant)

### Orange
Orange files produce WordClouds that might be from relevant articles

### Yellow
Yellow files produce WordClouds that include words such as 'health' (i.e., partly relevant)

### Blue
Blue files produce nonsensical WordClouds—need to clean again

In [None]:
#Creating DFs of green files only

green_briefings = pd.DataFrame(columns = column_list)

green_path = '/Users/daisyliu/Desktop/Research/UTEA_covid19_metaphors/notebooks/wordclouds/green'

green_files = [f for f in listdir(green_path) if isfile(join(green_path, f))]

for file in green_files:
    i = int(file)
    #print(i)
    green_briefings = green_briefings.append(press_briefings.iloc[i])

green_briefings['file_number'] = green_files
green_briefings.reset_index(drop = True)
green_briefings = green_briefings.append(press_briefings.iloc[28])
green_briefings.at[28, 'file_number'] = "28"
green_briefings = green_briefings.reset_index()

green_briefings

In [None]:
#Creating DFs of red files only

red_briefings = pd.DataFrame(columns = column_list)

red_path = '/Users/daisyliu/Desktop/Research/UTEA_covid19_metaphors/notebooks/wordclouds/red'

red_files = [f for f in listdir(red_path) if isfile(join(red_path, f))]

red_files.remove('.DS_Store')

for file in red_files:
    i = int(file)
    #print(i)
    red_briefings = red_briefings.append(press_briefings.iloc[i])

red_briefings['file_number'] = red_files
red_briefings.reset_index(drop = True)

In [None]:
# Creating DFs of orange files only

orange_briefings = pd.DataFrame(columns = column_list)

orange_path = '/Users/daisyliu/Desktop/Research/UTEA_covid19_metaphors/notebooks/wordclouds/orange'

orange_files = [f for f in listdir(orange_path) if isfile(join(orange_path, f))]

for file in orange_files:
    i = int(file)
    #print(i)
    orange_briefings = orange_briefings.append(press_briefings.iloc[i])

orange_briefings['file_number'] = orange_files
orange_briefings.reset_index(drop = True)

In [None]:
# Creating DFs of yellow files only

yellow_briefings = pd.DataFrame(columns = column_list)

yellow_path = '/Users/daisyliu/Desktop/Research/UTEA_covid19_metaphors/notebooks/wordclouds/yellow'

yellow_files = [f for f in listdir(yellow_path) if isfile(join(yellow_path, f))]

for file in yellow_files:
    i = int(file)
    #print(i)
    yellow_briefings = yellow_briefings.append(press_briefings.iloc[i])

yellow_briefings['file_number'] = yellow_files
yellow_briefings = yellow_briefings.reset_index(drop = True)

yellow_briefings

In [None]:
# Creating DFs of blue files only

blue_briefings = pd.DataFrame(columns = column_list)

blue_path = '/Users/daisyliu/Desktop/Research/UTEA_covid19_metaphors/notebooks/wordclouds/blue'

blue_files = [f for f in listdir(blue_path) if isfile(join(blue_path, f))]

for file in blue_files:
    i = int(file)
    #print(i)
    blue_briefings = blue_briefings.append(press_briefings.iloc[i])

blue_briefings['file_number'] = blue_files
blue_briefings.reset_index(drop = True)

# Creating corpus text files of entries from green_briefings df

## DO NOT RUN UNLESS YOU WANT ANOTHER SAMPLE!

green_index = [] #Creating number strings

for n in green_briefings.index:
    green_index.append(str(n))
    
### Creating corpus header
file_name = '{name}'
article_name = '<{article_name}>'
btype = '<{btype}>'
time = '<{time}>'
url = '<{url}>'
    
for row in green_briefings.index:
    
    data = green_briefings.iloc[row]
    #print(data)
    
    row_file_name = file_name.format(name = str(data.file_number))+".txt"
    
    with open(row_file_name, "a") as file:
        file.write(article_name.format(article_name = data.title)+'\n')
        file.write(btype.format(btype = data.btype)+'\n')
        file.write(time.format(time = data.time)+'\n')
        file.write(url.format(url = data.url)+'\n')
        file.write(data.original_text)

### Now generating sample

sample = random.sample(range(0, 175), 87)

for s in sample:
    with open("sample.txt", "a") as file:
        file.write(green_briefings.iloc[s].original_text)