# S2 - Geosocial Media Data Text Classification
<a class="tocSkip"></a>  



This notebook is part of the Supplementary Material provided for the paper
_Mapping indicators of cultural ecosystem services use in urban green spaces based on text classification of geosocial media data_ published in the Ecosystem Services: Science, Policy & Practice Journal. This includes the HTML conversions of a series of three Jupyter notebooks as follows: 

    1. S1_GSM_Data_Processing&LanguageModelTraining.html
    2. S2_GSM_Data_TextClassification.html
    3. S3_Generate_ChiValueExpectationSurface.html
https://doi.org/10.1016/j.ecoser.2022.101508

In this Notebook the following processes are addressed:

    1. Creation of IDF dictionary for the calculation of the TF-IDF score used for the computation of the post embeddings
    2. Definition of labels and computation of label embeddings
    3. Classification of geosocial media posts related to aesthetic appreciation
    4. Classification of geosocial media posts related to wildlife recreation
 
**Input data**:
 - normalized Instagram and Flickr textual annotations in English and German (.csv file)
 - corpus containing all normalized Instagram and Flickr textual annotations (.txt file)
 - word2vec language model trained on geosocial media corpus (.model file)

**Output data**:
 - classified Instagram and Flickr textual annotations in English and German (.csv file)

In [1]:
import datetime as dt
from IPython.display import clear_output, display, Markdown
date = dt.date.today()
display(Markdown(f'**Last update: {date}**'))

**Last update: 2023-01-02**

## 1. Preparations

### 1.1. Load Dependencies

 - gensim - version 4.1.2
 - numpy - version 1.21.3
 - pandas - version 1.3.3
 - scipy - version 1.7.1

In [None]:
import pandas as pd
import scipy
import numpy as np
import matplotlib.pyplot as plt

from gensim import utils
from gensim.models import Word2Vec

import sys
from pathlib import Path
from IPython.display import clear_output, Markdown, display

In [None]:
INPUT = Path.cwd() / '01_Input'
OUTPUT = Path.cwd() / '02_Output'

### 1.2. Parse the normalized geosocial media posts into a pandas dataframe

see notebook  [S1_GSM_Data_Processing&LanguageModelTraining.ipynb](./S1_GSM_Data_Processing&LanguageModelTraining.ipynb)  for details


In [None]:
gsm_normalized_file = OUTPUT/'Normalized_GeosocialMediaData.csv'
cols = ['origin_id', 'latitude', 'longitude', 'user_guid', 'post_date','tags','post_title','post_body','post_text']
dtypes={'origin_id': str,'latitude': float, 'longitude': float, 'user_guid': str, 'post_date': str,'tags':str,'post_title':str,'post_body':str, 'post_text':str}
df = pd.read_csv(gsm_normalized_file,usecols=cols, dtype=dtypes, encoding = 'utf-8')
print(len(df),'normalized posts for DD')

In [None]:
df.head()

### 1.3. Load the pre-trained Word2Vec model

load the pretrained word2vec model usign gensim's Word2Vec
see notebook  [S1_GSM_Data_Processing&LanguageModelTraining.ipynb](./S1_GSM_Data_Processing&LanguageModelTraining.ipynb)  for details


In [None]:
lang_model_file = str( OUTPUT /'word2vec_GeosocialMedia.model')
model_w2v = Word2Vec.load(lang_model_file)

## 2. Topic Classification of Social Media Posts

### Workflow

The classification of the geosocial media posts is based on the calculation of the similarity score (**cosine similarity**) between a **label embedding** &  the **post embeddings** and follows the workflow exposed below:

1. Labels are conceptualized as a collection of words that semantically define the topics we are interested in. For each label a list of relevant keywords in English and German is defined and enhanced by seeking further semantically similar words through the identification of the top 10 most similar words in our geosocial media corpus. 

2. Each *label embedding* is computed by averaging the embeddings of its constituent words.

3. _Post embeddings_ are computed for each geosocial media post by calculating the weighted average of  the constituent words embeddings (we use TF-IDF scores as weighting factors)

4. we calculate the cosine similarity (in here _cosine distance_, which is defined as _1- cosine similarity_) between each post embedding and each of the label embeddings to determine if the geosocial media post is related to one of the topics considered. The thresholds for the cosine similarity measure were empirically determined employing a method proposed by Orkphol and Yang (2019), which includes the following steps:
    - Step 1: we randomly selected a sample dataset of 1,000 geosocial media posts for each of the topic, which was then manually annotated based on the relevance of each geosocial media post for each of the two studied topics (binarily encoded with 1 as being related to the topic and 0 as being unrelated to the topic).
    - Step 2: we calculated the cosine similarity scores between each of the annotated social media post and each of the two label embeddings.
    - Step 3: For each topic, a binary logistic regression model was built on 30% of the sample dataset, with the relevance defining the response variable and the cosine similarity values as the predictor.
    - Step 4: The threshold for each topic was then estimated by applying the models to obtain predicted probabilities for each geosocial media post in the remaining test datasets. The threshold corresponds to cutoff value for the cosine similarity measure where the predicted probability becomes greater than 0.5.

    Based on the results, we set 0.65 as the discrimination threshold of the cosine similarity (0.35 for the cosine distance) for both topics. 

### 2.1. Create IDF-Scores Dictionary (needed for the calculation of TF-IDF score)

In [None]:
# load the corpus created to train the word2vec model
corpus_file = OUTPUT/'corpus_GeosocialMedia.txt'

# read corpus file into a list
corpus = [line.rstrip('\n') for line in open(corpus_file)]

# create Document Frequency dictionary for all the tokens in the corpus
doc_frequency = {}
for i in range(0,len(corpus)):
    tokens = corpus[i]
    for w in tokens.split(' '):
        try:
            doc_frequency[w].add(i)
        except:
            doc_frequency[w] = {i}
            
for i in doc_frequency:
    doc_frequency[i]= len(doc_frequency[i])
    
vocabulary = [w for w in doc_frequency]
vocabulary_lenght = len(vocabulary)

# create dictionary of idf-scores for all the tokens in the vocabulary
idf_scores = {}
for w in vocabulary:
    dfreq = doc_frequency[w]
    idf = np.log(vocabulary_lenght/dfreq)
    idf_scores[w] = idf

### 2.2. Calculate _label embeddings_ and _post embeddings_

In [None]:
def has_vector_representation(lang_model, text):
    """Word2Vec cannot handle out-of-vocabulary; to avoid errors we check
    if at least one word of the document is in the word2vec dictionary"""
    n= len([w for w in text if w in list(lang_model.wv.key_to_index.keys())])
    if n>0:
        return True
    else:
        return False

def avg_topic_vector(lang_model, tokens_list):
    # remove out-of-vocabulary words
    tokens = []
    for token in tokens_list:
        if token in list(lang_model.wv.key_to_index.keys()):
            tokens.append(token)
    return np.average(lang_model.wv[tokens], axis=0)

def avg_post_vector(lang_model, tokens_list,idf):
    tokens = []
    weights = []
    for token in tokens_list:
        if token in list(lang_model.wv.key_to_index.keys()):
            tokens.append(token)
            tf = tokens_list.count(token)/len(tokens_list)
            tfidf= tf*idf[token]
            weights.append(tfidf)
    return np.average(lang_model.wv[tokens], weights =weights, axis=0)

<div class="alert alert-warning" role="alert" style="color: black;"> 
   <details><summary> <b> Definition and computation of label embeddings</b></summary> </br>
Topics:
    <br> 
    <b> 1. Selfies </b>
    <br>
Social media platforms such as Instagram are often  used for marketing and self-promotions purposes. Thus, a large volume of the content shared on this platform is represented by portrait photography ("selfies") and fashion-related posts. Furthermore, photography designated platforms such as Flickr also contain a significant number of portraits or fashion photography media objects.  This content is not relevant for our analysis (when it comes to aesthetic appreciation and wildlife recreation) therefore, it needs to  be filtered out from our dataset.
    <br>
    <b> 2. Aesthetic appreciation </b> 
    <br>
To identify Flickr and Instagram posts related to aesthetic appreciation we compiled a list of English and German keywords that describe aesthetic, nature, and landscape photography.
    <br>
    <b> 3. Wildlife recreation </b>
For the identification of the geosocial media posts related to wildlife observations and photography we compiled a list of words describing fauna and flora and disregarded any words related to animals kept in zoos, pets, or animal tattoos for the definition of the topic (as a disambiguation step). 
    <br>

    
    
</div>

In [None]:
# this functions will create an extensive lists of terms for each label
def expand_topic_list (topic_li):
    enhanced_list = []
    for keyword in topic_li:
        similar_words = model_w2v.wv.most_similar(positive = [keyword])
        enhanced_list += [w[0] for w in similar_words]
    enhanced_list += topic_li
    return set(enhanced_list)

#used for the disambiguation step
def difference(lst1, lst2):
    lst3 = [value for value in lst1 if value not in lst2] 
    return lst3

#selfie embedding

selfie_list = ['selfie','portrait','porträt','girl','boy','mädchen','cosmetic',
               'kosmetik','makeup','beauty','model','fashion']
selfie = expand_topic_list(selfie_list)
selfie_embedding = avg_topic_vector(model_w2v,selfie)

#aesthetic embedding

aesthetic_list = ['aesthetic','beautiful','breathtaking','brilliant','enchanting','enjoying','gorgeous',
                  'landscape', 'magnificent', 'nature', 'outdoor', 'outstanding','panorama','pretty',
                  'scenery','scenic','splendid', 'ausblick', 'ansicht', 'aussicht','bezaubernd','genießen',
                  'großartig', 'kulturlandschaft', 'landschaft', 'landschaftlich','natur','prachtvoll','prächtig',
                  'rundblick','toll','überwältigend']
aesthetic = expand_topic_list(aesthetic_list)
aesthetic_embedding = avg_topic_vector(model_w2v,aesthetic)

#wildlife embedding

wildlife_list = ['animal','bird','butterfly','fauna','flora','flower', 'fungus', 'insect', 'mushroom', 'plant', 
                 'reptile','tree','wild','wildlife','baum','blume','insekt','kerbtier','pflanze','pilz',
                 'schmetterling','tiere','tierwelt','vogel']
wildlife_disambiguation = ['pet','haustier','zoo','tiergarten','tierpark','zoologischergarten','tattoo','dog',
                              'hund','cat','katze']
wildlife = difference(expand_topic_list(wildlife_list),expand_topic_list(wildlife_disambiguation))
wildlife_embedding = avg_topic_vector(model_w2v,wildlife)

### 2.3. Classify the geosocial media posts and save the results

In [None]:
# add new columns to the dataframe to save the classification results
df.reindex(df.columns.tolist() + ['selfie','cos_dist_selfie',
                                            'aesthetic','cos_dist_aesthetic',
                                            'wildlife', 'cos_dist_wildlife'], axis=1)
df.head()

In [None]:
%%time

x = 0
total_records = len(df)

for index, row in df.iterrows():
    x+=1
    msg_text = (
        f'Processed records: {x} ({x/(total_records/100):.2f}%). ')
    if x % 100 == 0:
        clear_output(wait=True)
        print(msg_text)
        
    text = row['post_text'].split(' ')
    if has_vector_representation(model_w2v, text) == True:
        post_embedding = avg_post_vector(model_w2v, text, idf_scores)
        cos_dist_selfie = scipy.spatial.distance.cosine(selfie_embedding, post_embedding, w=None)
        cos_dist_aesthetic = scipy.spatial.distance.cosine(aesthetic_embedding, post_embedding, w=None)
        cos_dist_wildlife = scipy.spatial.distance.cosine(wildlife_embedding, post_embedding, w=None)
        if cos_dist_selfie<0.3:
            df.at[index,'selfie'] = 1
            df.at[index,'cos_dist_selfie'] = cos_dist_selfie
            if cos_dist_aesthetic<0.35:
                df.at[index,'aesthetic'] = 2
                df.at[index,'cos_dist_aesthetic'] = cos_dist_aesthetic
            else:
                df.at[index,'aesthetic'] = 0
                df.at[index,'cos_dist_aesthetic'] = cos_dist_aesthetic
            if cos_dist_wildlife<0.35:
                df.at[index,'wildlife'] = 1
                df.at[index,'cos_dist_wildlife'] = cos_dist_wildlife
            else:
                df.at[index,'wildlife'] = 0
                df.at[index,'cos_dist_wildlife'] = cos_dist_wildlife

        else:
            df.at[index,'selfie'] = 0
            df.at[index,'cos_dist_selfie'] = cos_dist_selfie
            if cos_dist_aesthetic<0.35:
                df.at[index,'aesthetic'] = 1
                df.at[index,'cos_dist_aesthetic'] = cos_dist_aesthetic
            else:
                df.at[index,'aesthetic'] = 0
                df.at[index,'cos_dist_aesthetic'] = cos_dist_aesthetic
            if cos_dist_wildlife<0.35:
                df.at[index,'wildlife'] = 1
                df.at[index,'cos_dist_wildlife'] = cos_dist_wildlife
            else:
                df.at[index,'wildlife'] = 0
                df.at[index,'cos_dist_wildlife'] = cos_dist_wildlife
# final status
clear_output(wait=True)
print(msg_text)

In [None]:
# save df as csv file
df.to_csv('./02_Output/DD_Flickr&Instagram_Classified.csv')

#### References:

1. Orkphol, K., & Yang, W. (2019). Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet. Future Internet, 11(5), 114. https://doi.org/10.3390/fi11050114

2. The code used to train determine the cosine similarity thresholds is an adaptation of the notebook proposed by the authors of the method, which is to be found at: 
https://anaconda.org/korawit/similarity_threshold/notebook