

# S1 - Geosocial Media Data Processing & Language Model Training
 


This notebook is part of the Supplementary Material provided for the paper
_Mapping indicators of cultural ecosystem services use in urban green spaces based on text classification of geosocial media data_ published in the Ecosystem Services: Science, Policy & Practice Journal. This includes the HTML conversions of a series of three Jupyter notebooks as follows: 

    1. S1_GSM_Data_Processing&LanguageModelTraining.html
    2. S2_GSM_Data_TextClassification.html
    3. S3_Generate_ChiValueExpectationSurface.html
https://doi.org/10.1016/j.ecoser.2022.101508

In this Notebook the following processes are addressed:

    1. Text Normalization of Instagram & Flickr geotagged posts
    2. Creation of geosocial media corpus required to train a word2vec model
    2. Language Model Training - word2vec model training using the Gensim topic modeling toolkit (Řehůřek & Sojka, 2010)
    
**Input data**:
 - collected Instagram and Flickr textual annotations (.csv file)
 - list of stop words (.txt file)

**Output data**:
 - normalized Instagram and Flickr annotations in English and German (.csv file)
 - word2vec language model trained on geosocial media corpus (.model file)

In [5]:
import datetime as dt
from IPython.display import clear_output, display, Markdown
date = dt.date.today()
display(Markdown(f'**Last update: {date}**'))

**Last update: 2023-01-02**

### 1. Load dependencies

 - gensim - version 4.1.2
 - nltk - version 3.6.5
 - pandas - version 1.3.3
 - polyglot - version 16.7.4


In [None]:
import pandas as pd
import string
import itertools
from pathlib import Path

# NLP Tools for text normalization
import ftfy
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer
from polyglot.detect import Detector
# dissable warnings from polyglot module
import logging
logging.getLogger("polyglot").setLevel(logging.CRITICAL)
import warnings
warnings.filterwarnings('ignore')

# Tools to work with WordEmbeddings
from gensim import utils
from gensim.models import Word2Vec

In [None]:
INPUT = Path.cwd() / '01_Input'
OUTPUT = Path.cwd() / '02_Output'

### 2. Parse the collected geosocial media posts into a pandas dataframe
 - all Flickr and Instagram posts were saved in a single .csv file with an _origin_id_ column to differentiate between the two platforms (we encoded Instagram with the value 1 and Flickr with 2)

In [None]:
input_file = INPUT/'01_FlickrInstagram_Unfiltered.csv'
cols = ['origin_id', 'latitude', 'longitude', 'user_guid', 'post_date','tags','post_title','post_body']
dtypes={'origin_id': str,'latitude': float, 'longitude': float, 'user_guid': str, 'post_date': str,'tags':str,'post_title':str,'post_body':str}

df_dd = pd.read_csv(input_file,usecols=cols, dtype=dtypes,  encoding = "UTF-8")
print(len(df_dd), "GSM posts collected in total for the city of Dresden out of which",
      len(df_dd[df_dd.origin_id=='1']),'were shared on Instagram and', 
      len(df_dd[df_dd.origin_id=='2']),'were shared on Flickr')

### 3. Geosocial media Text Normalization & Corpus Creation

The textual metadata collected from Instagram and Flickr is represented by the post's title, body (caption) and the tags the user added to the post. 
We concatenate the _tags_, _posts_title_ and _post_body_  in a new column called _post_text_, which is regarded as a constituent document of the geosocial media corpus that is being created.

In [None]:
df_dd["post_text"] = df_dd["tags"].fillna('') + ' ' + df_dd["post_title"].fillna('') + ' ' + df_dd["post_body"].fillna('')

Text normalization and language detection of geosocial media posts by performing the following tasks:
 - set all words to lowercase
 - remove the mentions (@username)
 - remove all punctuations, including the question and exclamation marks
 - remove the URLs as they do not contain useful information (http:\\ as well as www.)
 - remove all html tags (<>)
 - remove digits
 - remove multiple characters (people sometimes repeat characters for added emphasis as in e.g. "haaaaaaaappy")
 - select only English and German posts for further analysis
 - remove stopwords (English and German stopwords and all the words from the SortOutAlways_InStr files, which include irrelevant words for our analysis representing mostly technical terms associated with photo cameras originating from Flickr content) 

In [None]:
class Preprocessing():

    @staticmethod
    def fix_bad_unicode (input_text):
        return ftfy.fixes.fix_encoding(input_text)
    
    @staticmethod
    def remove_mentions(input_text):
        return re.sub(r'[^\s]*@[^\s]*','',input_text)
    
    @staticmethod
    def remove_urls(input_text):
        return re.sub(r'http.?://[^\s]+[\s]?', '', input_text)
        return re.sub(r'www?.\S+','',input_text)
        return re.sub(r'ww?.\S+','',input_text)
    
    @staticmethod
    def remove_html_tags(input_text):
        return re.sub(r'<.+>[^<>]*</.>', '',input_text)
    
    @staticmethod
    def remove_letter_numbers_combos (input_text):
        return re.sub(r'[^\s0-9]+[0-9]+', '', input_text)
    
    @staticmethod
    def remove_punctuation(input_text):
    # Make translation table
        input_text = re.sub(' +',' ',input_text)
        punct = '!"$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
        trantab = str.maketrans(punct, len(punct)*' ')  # Every punctuation symbol will be replaced by a space
        return input_text.translate(trantab)
    
    @staticmethod
    def remove_digits(input_text):
        return re.sub('\d+', '', input_text)
    
    @staticmethod
    def remove_multiple_characters(input_text):
        return''.join(''.join(s)[:2] for _, s in itertools.groupby(input_text))
    
    @staticmethod
    def lang_detect(input_text):
        #language detection with the polyglot library
        l = Detector(input_text, quiet=True).language.code
        if (l=='de'or l=='en'):
            return True
        else:
            return False
    
    @staticmethod
    def remove_stopwords(input_text, stopwords_list):
        text = input_text.replace('#','')
        words = text.split(' ') 
        clean_words = [word for word in words if (word not in stopwords_list) and len(word) > 1] #remove single characters
        return " ".join(clean_words)


In [None]:
#initialize the corpus list
corpus_list = []

#initialize dataframe for the normalized geosocial media posts
normalized_df = pd.DataFrame({'origin_id': [],'latitude':[],'longitude':[], 'user_guid':[],'post_date':[],
                                 'tags':[],'post_title':[],'post_body':[], 'post_text':[]})

#load stopwords lists 
with open(INPUT/'SortOutAlways_inStr.txt', 'r') as f:
            list_out = f.read().splitlines()
punctuation_list= [s for s in string.punctuation]
alphabet_string = string.ascii_lowercase
single_letters = list(alphabet_string)
stopwords_list = stopwords.words('english') + stopwords.words('german') + list_out + punctuation_list + single_letters

In [None]:
%%time
x=0
total_records = len(df_dd)
for index, row in df_dd.iterrows():
    x+=1
    msg_text = (
        f'Processed records: {x} ({x/(total_records/100):.2f}%). ')
    if x % 100 == 0:
        clear_output(wait=True)
        print(msg_text)
    text = row['post_text']
    if len(text) > 2:
        text = Preprocessing.fix_bad_unicode(text)
        text = text.lower()
        text = Preprocessing.remove_mentions(text)
        text = Preprocessing.remove_urls(text)
        text = Preprocessing.remove_html_tags(text)
        text = Preprocessing.remove_letter_numbers_combos(text)
        text = Preprocessing.remove_digits(text)
        text = Preprocessing.remove_punctuation(text)
        text = Preprocessing.remove_multiple_characters(text)
        if Preprocessing.lang_detect(text) == True:
            text = Preprocessing.remove_stopwords(text, stopwords_list)
            if (len(text)>0):
                corpus_list.append(text)
                normalized_df = normalized_df.append({'origin_id': row['origin_id'],
                                                      'latitude' : row['latitude'],
                                                      'longitude' : row ['longitude'],
                                                      'user_guid' : row['user_guid'],
                                                      'post_date': row['post_date'],
                                                      'tags': row['tags'],
                                                      'post_title': row['post_title'],
                                                      'post_body': row['post_body'],
                                                      'post_text' : text}, ignore_index=True)       


# final status
clear_output(wait=True)
print(msg_text)

Save the normalized geosocial media posts as .csv file

In [None]:
# save dataframe as .csv
output_file = OUTPUT/'Normalized_GeosocialMediaData.csv'
normalized_df.to_csv(output_file, encoding= "UTF-8")

Save the Corpus as .txt file

In [None]:
outputCorpus = OUTPUT/'corpus_GeosocialMedia.txt'
with open(outputCorpus, 'w') as f:
    for item in corpus_list:
        f.writelines("%s\n" % item)

### 4. Training word2vec language model and Word Embeddings on the geosocial social media corpus previously created

In [None]:
#read corpus file
corpusFile = OUTPUT/'corpus_GeosocialMedia.txt'
corpus = [line.rstrip('\n').split(' ') for line in open(corpusFile)]

#train word2vec model on geosocial media corpus 
model_w2v = Word2Vec(sentences=corpus,
                     corpus_file=None,
                     size=300, 
                     alpha=0.025, 
                     window=5,
                     min_count=5,
                     max_vocab_size=None,
                     sample=0.001,
                     seed=1,
                     workers=4,
                     min_alpha=0.0001,
                     sg=1,
                     hs=0,
                     negative=5,
                     ns_exponent=0.75,
                     iter=15,
                     null_word=0,
                     trim_rule=None,
                     sorted_vocab=1,
                     batch_words=10000,
                     compute_loss=False,
                     callbacks=(),
                     max_final_vocab=None)

In [None]:
#save model
model_file = OUTPUT/'word2vec_GeosocialMedia.model'
model_w2v.save(model_file)

#### References

1. Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient Estimation of Word Representations in Vector Space. ICLR Workshop. https://doi.org/10.48550/arXiv.1301.3781
2. Řehůřek, R., Sojka, P., 2010. Software Framework for Topic Modelling with Large Corpora. In Proc. of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valletta, Malta: ELRA.
