# Convert altmetric dump to CSV 
This jupyter notebook contains the code utilized to parse the content of over 100GB of data from the altmetric june 2018 dataset.
The purpose of this code was to extract just the necessary information from this big dataset, shrinking the size to around 1.5GB

In [None]:
import os
import numpy as np
import csv
import json

The __parse_file__ function takes as input the path of a file, opens it and read it line by line extracting the json related to a single altmetric. The parser analyzes the json file and extrat the current features:
- __altmetric_id:__  article unique identifier
- __title:__  title of the article
- __subjects:__  list of subjects of the article
- __scopus date:__  list of article subjects
- __pubdate:__  publication date of the article
- __abstract:__  atricle abstract
- __fb_wall_count:__  number of times the article was shared on facebook
- __fb_wall_urls:__  list of public posts shared by public pages

Being the dataset not complete, we can noticed that not all the articles have all the features, so we performed checks on each one of them and fill the values with numpy.nan in case the feature is missing

In [None]:
def parse_file(filepath):
    result = []

    #open the file
    with open(filepath) as f:
        #list of tication fields we want
        citationFields = ['title','subjects','scopus_subjects','pubdate','abstract']
        #for each altmetric in the file
        for line in f:
            #we load the json data
            data = json.loads(line)
            #we check if we have posts
            if(len(data['posts'])):
                if 'facebook' in data['posts']: #if we have facobook posts
                    content = {}
                    article_urls=[]
                    
                    for post in data['posts']['facebook']:  #we get the links
                        article_urls.append(post['url'])

                    content['altmetric_id'] = data['altmetric_id'] #altmetricID


                    for feature in citationFields:
                        if feature in data['citation']:
                            content[feature] = data['citation'][feature]  #title_article
                        else:
                            content[feature] = np.nan

                    if 'publisher_subjects' in data['citation']:
                        content['publisher_subjects'] = data['citation']['publisher_subjects'][0]['name']  #publisher subject
                    else:
                        content['publisher_subjects'] = np.nan
                    
                    content['fb_wall_count'] = len(data['posts']['facebook']) #number of facebook posts

                    content['fb_wall_urls'] = article_urls

                    result.append(content)

    return result


The __create_dataset__ function is used to perrofm a walks over all the files available in the dumped altmetric dataset, and call the function __parse_file__ to extract the values needed. 
Once the functionparse_file returns, create dataset will write the extracted value in a CSV file

In [1]:
def create_dataset(dataset_file,data_folder,extension):
    #open the csv file in writing mode
    with open(dataset_file, 'w') as fd:
        csvwriter = csv.writer(fd)
        #write the header to file
        header = ["altmetric_id", "title", 'subjects','scopus_subjects',"publisher_subjects","abstract",'pubdate' ,"fb_wall_count","fb_wall_urls"]
        csvwriter.writerow(header)
        
        #get the number of directories we need to scan to show a progress pencentage
        numOfDIrs = sum(os.path.isdir(data_folder+'/'+i) for i in os.listdir(data_folder))
        currDirNum=1
        #loop over all the directories
        for dirpath, dirnames, files in os.walk(data_folder):
            print("Analyzing directory {}/{}: {}".format(currDirNum,numOfDIrs,dirpath))
            currDirNum+=1
            #get all the .txt files
            for name in files:
                if extension and name.lower().endswith(extension):

                    file = os.path.join(dirpath, name)
                    
                    #parse each file and then write it in the csv
                    for article in parse_file(file):
                        current_row=[]
                        for column in header:
                            if column in article:
                                current_row.append(article[column])
                            else:
                                current_row.append(np.nan)
                        csvwriter.writerow(current_row)


Call used to start the whole parsing process

In [None]:
create_dataset('/media/mfattoru/Backup Data/altmetric_Dataset/extracted_dataset/altmetrics_dataset.csv',
               '/media/mfattoru/Backup Data/altmetric_Dataset/keys',
               '.txt')

# Fetch reaction information from Facebook

We define all the libraries we need to fetch the reactions, also we define the class bcolors that will be used to color the terminal output. This will help to detect if errors happened while running the script in background

In [1]:
from urllib.parse import urlparse
import pandas as pd
import numpy as np
import requests
import csv
import sys
from time import sleep
from pathlib import Path
import os
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import datetime
from random import randint

startSleep = 14
endSleep = 18

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

The function queryGraphApi is used to perform a query to the facebook graph api.
It takes as input a properly built query for the graph api, and executes it, returning the response.
Given the limits enforced by the number of queries that can be executed each hour for our api key, the function takes care of performing timed sleeps for each query in a way to don't exceed the limit of queries available.
In case for external use of the api, the number it's exceeded, the function will sleep enough time to be able to reduce the workload in the API

In [None]:
def queryGraphApi(url):
    responseCode = 0
    tries = 1
    while(responseCode !=200 ):
        print("Try fetch number {}".format(tries))
        response = requests.get(url)
        responseCode = response.status_code
        tries+=1
        sleepTime=20
        callCount=eval(response.headers['X-App-Usage'])['call_count']

        if response.status_code == 400:  #the page went private
            response=400
            break
        if response.status_code != 200:
            
            print( bcolors.FAIL + 'ERROR' + bcolors.ENDC + ': API response: ' + bcolors.FAIL+ '{}'.format(response.status_code) + bcolors.ENDC)
            
            print( bcolors.WARNING + "WARNING" + bcolors.ENDC+": You used already {}/100 calls percentage!".format(callCount))
            if(callCount > 98):
                sleepTime = 2*endSleep*(callCount-95)
            else:
                sleepTime = randint(2*startSleep,2*endSleep)

            print( bcolors.OKBLUE + "INFO" + bcolors.ENDC + ": Sleeping {} seconds before retrying...".format(sleepTime))
            sleep(sleepTime)
    print( bcolors.OKBLUE + "INFO" + bcolors.ENDC + ": You used {}/100 calls percentage!".format(callCount))
    if(callCount > 98):
        sleepTime = 2*endSleep*(callCount-95)
        print( bcolors.WARNING + "WARNING" + bcolors.ENDC + ": Sleeping {} seconds to lower the limit...".format(sleepTime))
    else:
        sleepTime = randint(startSleep,endSleep)
        print( bcolors.OKBLUE + "INFO" + bcolors.ENDC + ": Sleeping {} seconds to keep the limit...".format(sleepTime))
    sleep(sleepTime)
    return response

The function cut_files takes as input:
- __filename:__ location of the input file
- __numOfLines:__ number of lines that we want to remove from the csv file
- __fd:__ A file descriptor used to be sure that the file has been previously properly closed

This function will remove from the input filename the defined amount of lines in numOfLines, and will take care of creatink a backup file of the dataset before perforning any operation.
This is called when CTRL + C is pressed during the execution of the main script, so that we can save the progress made un to that point and then restore the script starting from where we were left

In [None]:
def cut_files(filename,numOfLines,fd):
    print( bcolors.WARNING + '\nYou pressed Ctrl+C!, I\'m deleting lines from the csv, be patient' + bcolors.ENDC)
    fd.close();
    counter=0
    with open(filename,'r+') as csvfile:
        with open(filename+".bkp", 'w') as bkpfile:
            with open(filename+".tmp", 'w') as tmpfile:
                bkpWriter = csv.writer(bkpfile)
                csvWriter = csv.writer(tmpfile)
                reader = csv.reader(csvfile)
                for row in reader:
                    bkpWriter.writerow(row)
                    if(counter==0 or counter > numOfLines):
                        csvWriter.writerow(row)
                    counter+=1
    os.remove(filename)
    os.rename(filename+".tmp",filename)
    print( bcolors.OKGREEN + "I'm done, deleted {} rows, enjoy!".format(numOfLines) + bcolors.ENDC)
    sys.exit(0)

The function parsePage takes in input the pageID of a facebook page, and returns the number of likes and followers a page had.
For some pages the html requested doesn't properly contain the needed values, might be because facebook refuses to serve too may requests in a certain amount of time, and so the function will throw an exception, that it's properly handled in the caller of this function

In [None]:
def parsePage(pageId):
    pageCode = 0
    while pageCode != 200:
        pageHtml = urlopen("https://www.facebook.com/"+pageId)
        pageCode = pageHtml.getcode()
        if pageCode != 200:
            print( bcolors.FAIL + "ERROR" + bcolors.ENDC + ": Page parsing status: "+ bcolors.FAIL + "{}".format(pageCode))
            sleep(randInt(45,70))
            continue
        else:
            print( bcolors.OKBLUE + "INFO" + bcolors.ENDC + ": parsed page status: " + bcolors.OKGREEN + "{}".format(pageCode) + bcolors.ENDC)
            soup = BeautifulSoup(pageHtml, 'html5lib')
            soupSelect = soup.select('#pages_side_column ._4bl9 div')

            pagelikes = soup.find(id="PagesLikesCountDOMID").contents[0].contents[0]
            pagelikes = int(str(pagelikes).replace(",",""))
            if(pagelikes == 0):
                pagelikes = soupSelect[0].contents[0]
                pagelikes = str(pagelikes).replace(",","")
                pagelikes=re.findall('\d+',pagelikes)
                pagelikes = int(pagelikes[0])

            pagefollowers = soupSelect[1].contents[0]
            pagefollowers = str(pagefollowers).replace(",","")
            pagefollowers=re.findall('\d+',pagefollowers)[0]
            pagefollowers=int(pagefollowers)

            print( bcolors.OKGREEN + "OK" + bcolors.ENDC + ": Found {} likes and {} followers!".format(pagelikes,pagefollowers))
            return pagelikes,pagefollowers

This is the main caller for the reaction fetcher, this will take care of opening an input csv file, properly sticking together a request to the facebook graph api, fetch the data and the number of facebook followers, and then write everything to a dataset in csv format.
The profram can be interrupted by pressing CTRL + C (Not sure if it's the same signal sent by stop in a jupyter notebook), and it will grafelully save all the results obtained  up to that moment, and delete the already parsed lines from the input_file dataset.
Restarting the process will restore it from where it was left, appending the results to the same dataset

In [None]:
if __name__ == '__main__':

    output_file = "/media/mfattoru/Backup Data/altmetric_Dataset/extracted_dataset/output_1_shares.csv"
    access_token = "" #private
    graphurl = "https://graph.facebook.com/v2.2/"
    fields = "?fields=updated_time,message,name,caption,description,shares,reactions.type(LIKE).limit(0).summary(total_count).as(like)%2Creactions.type(LOVE).limit(0).summary(total_count).as(love)%2Creactions.type(WOW).limit(0).summary(total_count).as(wow)%2Creactions.type(HAHA).limit(0).summary(total_count).as(haha)%2Creactions.type(SAD).limit(0).summary(total_count).as(sad)%2Creactions.type(ANGRY).limit(0).summary(total_count).as(angry)"
    parameters = "&access_token={}".format(access_token)
    parsedLines = 0

    input_file = "/media/mfattoru/Backup Data/altmetric_Dataset/extracted_dataset/output_1.csv"

    openOperation = 'w'

    try:
        os.stat(output_file)
        openOperation = 'a'
    except:
        openOperation = 'w'

    with open(output_file, openOperation) as fd:
        csvwriter = csv.writer(fd)

        header = ["altmetric_id", "title", 'subjects',"abstract",'pubdate' ,"fb_wall_count",'scopus_subjects','publisher_subjects',"fb_wall_urls"]
        emotions = ['like','love','wow','haha','sad','angry']
        if openOperation == 'w':
            print( bcolors.OKBLUE + "INFO" + bcolors.ENDC + ": File didn't exist before, so writing down the header")
            csvwriter.writerow(header+["shares","visibility"]+["total_" + s for s in emotions])


        df =  pd.read_csv(input_file)
        try:
            for i, row in df.iterrows():
                linkEmotionList=[]
                current_row=[]
                totalEmotions={}
                visibility = {}
                shares = 0
                numOfPages = 0
                missingDataPages = 0

                for link in eval(row['fb_wall_urls']):
                    print( bcolors.OKBLUE + "INFO" + bcolors.ENDC + ": Parsing and requesting info about: "+link )
                    dataDict = {'link':link}
                    # visibility = {}
                    linkData = urlparse(link)
                    if(linkData.query!=''):
                        story=linkData.query.split("&")[0]
                        storyId=story.split("=")[1]

                        pageId=linkData.query.split("&id=")[1]
                    elif(linkData.path!=''):
                        storyId = linkData.path.split("/")[3]
                        pageId = linkData.path.split("/")[1]
                    node = pageId + "_" + storyId
                    base_url = graphurl + node + fields + parameters

                    fb_data = queryGraphApi(base_url)
                    if(fb_data == 400):
                        print( bcolors.FAIL + "ERROR" + bcolors.ENDC + ": The page went private!!")
                        continue


                    if 'shares' in fb_data.json():
                        shares += fb_data.json()['shares']['count']

                    for emotion in emotions:
                        if emotion in fb_data.json():

                            emotionValue = fb_data.json()[emotion]['summary']['total_count']
                            dataDict[emotion] = emotionValue

                            if emotion in totalEmotions:
                                totalEmotions[emotion]+=emotionValue
                            else:
                                totalEmotions[emotion]=emotionValue
                        else:
                            dataDict[emotion]=np.nan

                    try:
                        likes,followers = parsePage(pageId)
                        numOfPages+=1
                        if pageId in visibility:
                            visibility[pageId]+=followers
                        else:
                            visibility[pageId]=followers
                    except:
                        print( bcolors.FAIL + "ERROR: Unable to get likes and followers!" + bcolors.ENDC )
                        missingDataPages+=1
                        likes = np.nan
                        followers = np.nan

                    dataDict['page_likes'] = likes
                    dataDict['page_followers'] = followers

                    if 'message' in fb_data.json():
                        dataDict['message'] = fb_data.json()['message']
                        dataDict['has_text'] = 1
                    else:
                        dataDict['message'] = np.nan
                        dataDict['has_text'] = 0

                    linkEmotionList.append(dataDict)

                if(len(linkEmotionList) != 0):
                    for column in header[:-1]: #we don't save the url, as we are going to overwrite it with the list of dictionary
                        if column in row:
                            current_row.append(row[column])
                        else:
                            current_row.append(np.nan)

                    total_visibility = 0
                    for key,value in visibility.items():
                        total_visibility += value

                    if(numOfPages == 0):
                        numOfPages=1
                    missingFollowers = (total_visibility//numOfPages)*missingDataPages
                    if(missingFollowers > 0):
                        print( bcolors.WARNING + "WARNING" + bcolors.ENDC+": Adding {} followers as {} pages were missing!".format(missingFollowers,missingDataPages))
                        total_visibility += missingFollowers

                    current_row.append(linkEmotionList)
                    current_row.append(shares)
                    current_row.append(total_visibility)


                    for key,value in totalEmotions.items():
                         current_row.append(value)
                    csvwriter.writerow(current_row)
                    parsedLines+=1
                    print( bcolors.OKGREEN + "[ {} ] parsed line {}".format(datetime.datetime.now(),parsedLines) + bcolors.ENDC)
                else:
                    print( bcolors.WARNING + "[ {} ] skipped line {}".format(datetime.datetime.now(),parsedLines) + bcolors.ENDC)
                # sleep(18)
        except KeyboardInterrupt:
            cut_files(input_file,parsedLines,fd)
        except Exception as e:
            print ( bcolors.FAIL + "ERROR" + bcolors.ENDC + ": Failed randomly: "+str(e))
            cut_files(input_file,parsedLines,fd)
    print( bcolors.OKGREEN + "[ {} ] Completed the task".format(datetime.datetime.now()) + bcolors.ENDC)
    cut_files(input_file,parsedLines,fd)


# Restore Dataset Integrity
Sometimes can happen that the scraper for the number of followers of a page get's blocked as you've inserted a timeout on each query too small.
This piece of code will take care of fixing all the altmertics where the number of followers is zero, or nan, and will refetch the information from faebook. and then save everything in a new dataset

Importing necessary libraries, should be already imported if you executed the previous imports

In [3]:
from urllib.parse import urlparse
import pandas as pd
import numpy as np
import requests
import csv
import sys
from time import sleep
from pathlib import Path
import os
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import datetime
from random import randint
from numpy import nan

modified version of the parsePage function, which will handle the error of not finding values, and return -1 instead of raising an error

In [None]:
def parsePage(pageId):
    pageCode = 0
    tries = 1
    while pageCode != 200:
        try:
            pageHtml = urlopen("https://www.facebook.com/"+pageId)
            pageCode = pageHtml.getcode()
            print("INFO: TRY {} - parsed page status: {}".format(tries,pageCode))
            if pageCode != 200:
                print("ERROR: Page parsing status {}".format(pageCode))
                sleep(randInt(45,70))
                continue
            else:
                soup = BeautifulSoup(pageHtml, 'html5lib')
                soupSelect = soup.select('#pages_side_column ._4bl9 div')
                try:
                    # temp = soup.find(id="PagesLikesCountDOMID").contents[0].contents[0]
                    pagelikes = soup.find(id="PagesLikesCountDOMID").contents[0].contents[0]
                    pagelikes = int(str(pagelikes).replace(",",""))
                    if(pagelikes == 0):
                        pagelikes = soupSelect[0].contents[0]
                        pagelikes = str(pagelikes).replace(",","")
                        pagelikes=re.findall('\d+',pagelikes)
                        pagelikes = int(pagelikes[0])

                    pagefollowers = soupSelect[1].contents[0]
                    pagefollowers = str(pagefollowers).replace(",","")
                    pagefollowers=re.findall('\d+',pagefollowers)[0]
                    pagefollowers=int(pagefollowers)
                except: #crash if can't find the values
                    pagelikes=-1
                    pagefollowers=-1

                print("Found {} likes and {} followers!".format(pagelikes,pagefollowers))
                return pagelikes,pagefollowers
        except:
            if tries > 5:
                return -1,-1
            else:
                pageCode = 404
                tries+=1
                sleepTime=randint(25,36)
                print("ERROR: Sleeping {} before trying again :X".format(sleepTime))
                sleep(sleepTime)


The fix_visibility function is used to fix each row in a dataframe. it will check if the visibility row, the one that contais the number of followers in total for each page that shared the research article is equal to zero or nan which is clearly an error, and will requery facebook for the necessary information.
It might happen that one page between all of the ones who shared the paper is not available maybe because it went private, so we count the number of pages which we are unable to get information, and we replace it's value with the mean value of all the pages we were able to get data from

In [None]:
def fix_visibility(row):
    if row['visibility'] == 'nan' or row['visibility'] == 0:
        print("MATCH ID {} - VISIBILITY: {}".format(row['altmetric_id'],row['visibility']))
        visibility = 0
        missingDataPages = 0
        numOfPages = 0
#         likes=0
        for struct in eval(row['fb_wall_urls']):
            followers=0
            
            link = struct['link']
            print("Parsing and requesting info about: "+link)
            
            linkData = urlparse(link)
            if(linkData.query!=''):
                
                pageId=linkData.query.split("&id=")[1]
                
            elif(linkData.path!=''):

                pageId = linkData.path.split("/")[1]
            
            _,followers = parsePage(pageId) 
            if(followers == -1): #unable to parse the page
                missingDataPages+=1
            else:
                numOfPages+=1
                visibility += followers
            
            sleepTime=randint(25,36)
            print("Sleeping {} to don't get blocked :(".format(sleepTime))
            sleep(sleepTime)
            
        #if we can't get info about a page we replace it's number of followere with the mean of the ones we got
        if(numOfPages == 0):
            numOfPages=1
        missingFollowers = (visibility//numOfPages)*missingDataPages
        visibility += missingFollowers
        print("TOTAL VISIBILITY: {}".format(visibility))
        return visibility
    else:
        return row['visibility']

In [None]:
if __name__ == '__main__':

    output_file = "/media/mfattoru/Backup Data/altmetric_Dataset/extracted_dataset/altmetrics_dataset_facebook_fixed.csv"

    input_file = "/media/mfattoru/Backup Data/altmetric_Dataset/extracted_dataset/altmetrics_dataset_facebook.csv"
    
    df = pd.read_csv(input_file)
    df['visibility'] = df.apply(fix_visibility,axis=1)
    df.to_csv(output_file,index=False)

# Restore wrong count of shares

In [None]:
from urllib.parse import urlparse
import pandas as pd
import numpy as np
import requests
import csv
import sys
from time import sleep
from pathlib import Path
import os
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import datetime
from random import randint
from numpy import nan
import os
HOUR = 900
startSleep = 14
endSleep = 18

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

In [None]:
def queryGraphApi(url):
    responseCode = 0
    tries = 1
    while(responseCode !=200 ):
        print("Try fetch number {}".format(tries))
        response = requests.get(url)
        responseCode = response.status_code
        tries+=1
        sleepTime=20
        callCount=eval(response.headers['X-App-Usage'])['call_count']

        if response.status_code == 400:  #the page went private
            response=400
            break
            #or should we return directly withouth wait? depends if the error results in a query or not
        if response.status_code != 200:

            print( bcolors.FAIL + 'ERROR' + bcolors.ENDC + ': API response: ' + bcolors.FAIL+ '{}'.format(response.status_code) + bcolors.ENDC)

            print( bcolors.WARNING + "WARNING" + bcolors.ENDC+": You used already {}/100 calls percentage!".format(callCount))
            if(callCount > 98):
                sleepTime = 2*endSleep*(callCount-95)
            else:
                sleepTime = randint(2*startSleep,2*endSleep)

            print( bcolors.OKBLUE + "INFO" + bcolors.ENDC + ": Sleeping {} seconds before retrying...".format(sleepTime))
            sleep(sleepTime)
    print( bcolors.OKBLUE + "INFO" + bcolors.ENDC + ": You used {}/100 calls percentage!".format(callCount))
    if(callCount > 98):
        sleepTime = 2*endSleep*(callCount-95)
        print( bcolors.WARNING + "WARNING" + bcolors.ENDC + ": Sleeping {} seconds to lower the limit...".format(sleepTime))
    else:
        sleepTime = randint(startSleep,endSleep)
        print( bcolors.OKBLUE + "INFO" + bcolors.ENDC + ": Sleeping {} seconds to keep the limit...".format(sleepTime))
    sleep(sleepTime)
    return response


In [None]:
def cut_files(filename,numOfLines,fd):
    print( bcolors.WARNING + '\nYou pressed Ctrl+C!, I\'m deleting lines from the csv, be patient' + bcolors.ENDC)
    fd.close();
    counter=0
    with open(filename,'r+') as csvfile:
        with open(filename+".bkp", 'w') as bkpfile:
            with open(filename+".tmp", 'w') as tmpfile:
                bkpWriter = csv.writer(bkpfile)
                csvWriter = csv.writer(tmpfile)
                reader = csv.reader(csvfile)
                for row in reader:
                    bkpWriter.writerow(row)
                    if(counter==0 or counter > numOfLines):
                        csvWriter.writerow(row)
                    counter+=1
    os.remove(filename)
    os.rename(filename+".tmp",filename)
    print( bcolors.OKGREEN + "I'm done, deleted {} rows, enjoy!".format(numOfLines) + bcolors.ENDC)
    sys.exit(0)

In [None]:
def getFollowers(row):
    if 'page_likes' in row and 'page_followers' in row:

        if type(row['page_followers']) is int:  #if it's an int, if not it's an nan
            if row['page_followers'] == 0 :
                raise Exception('Empty Visibility')
            else:
                print( bcolors.OKGREEN + "OK" + bcolors.ENDC + ": Found {} likes and {} followers!".format(row['page_likes'],row['page_followers']))
                return row['page_likes'],row['page_followers']
        else:
            return row['page_likes'],row['page_followers']
    else:
        return np.nan,np.nan

In [None]:
if __name__ == '__main__':

    input_file = "/media/mfattoru/Backup Data/altmetric_Dataset/extracted_dataset/Split Dataset/Cleaned/cleaned_2.csv"
    output_file = "/media/mfattoru/Backup Data/altmetric_Dataset/extracted_dataset/Split Dataset/Cleaned/clean_fix_share/cleaned_2_shares.csv"
    access_token = "" #private
    graphurl = "https://graph.facebook.com/v2.2/"
    fields = "?fields=updated_time,message,name,caption,description,shares,reactions.type(LIKE).limit(0).summary(total_count).as(like)%2Creactions.type(LOVE).limit(0).summary(total_count).as(love)%2Creactions.type(WOW).limit(0).summary(total_count).as(wow)%2Creactions.type(HAHA).limit(0).summary(total_count).as(haha)%2Creactions.type(SAD).limit(0).summary(total_count).as(sad)%2Creactions.type(ANGRY).limit(0).summary(total_count).as(angry)"
    parameters = "&access_token={}".format(access_token)
    parsedLines = 0
    lastEqualRows = 0

    openOperation = 'w'

    try:
        os.stat(output_file)
        openOperation = 'a'
    except:
        openOperation = 'w'

    with open(output_file, openOperation) as fd:
        csvwriter = csv.writer(fd)

        header = ["altmetric_id", "title", 'subjects',"abstract",'pubdate' ,"fb_wall_count",'scopus_subjects','publisher_subjects',"fb_wall_urls"]
        emotions = ['like','love','wow','haha','sad','angry']
        if openOperation == 'w':
            print( bcolors.OKBLUE + "INFO" + bcolors.ENDC + ": File didn't exist before, so writing down the header")
            csvwriter.writerow(header+["shares","visibility"]+["total_" + s for s in emotions])

        df =  pd.read_csv(input_file)
        row_count = len(df.index)

        try:
            for i, row in df.iterrows():
                linkEmotionList=[]
                current_row=[]
                totalEmotions={}
                visibility = {}
                shares = 0
                numOfPages = 0
                missingDataPages = 0
                #this line is to recheck the number of shares
                if 'shares' in row:
                    old_shares = row['shares']
                else:
                    old_shares = -1

                numOfLinks = len(eval(row['fb_wall_urls']))
                if numOfLinks <= 1:
                    parsedLines += 1
                    print( bcolors.OKGREEN + "GOOD" + bcolors.ENDC+": Found only {} link, so the data is correct".format(numOfLinks))
                    print( bcolors.OKGREEN + "[ {} ] Copied line {}/{}".format(datetime.datetime.now(),parsedLines,row_count) + bcolors.ENDC)
                    csvwriter.writerow(row)
                else:
                    for link in eval(row['fb_wall_urls']):

                        if type(link) is not str:  #it's a dictionary
                            linkDict = link
                            #hack as i'm lazy
                            link = link['link']
                        else:  #it's a string
                            linkDict = eval(link)

                        print( bcolors.OKBLUE + "INFO" + bcolors.ENDC + ": Parsing and requesting info about: "+link )
                        dataDict = {'link':link}

                        linkData = urlparse(link)
                        if(linkData.query!=''):
                            story=linkData.query.split("&")[0]
                            storyId=story.split("=")[1]

                            pageId=linkData.query.split("&id=")[1]
                        elif(linkData.path!=''):
                            storyId = linkData.path.split("/")[3]
                            pageId = linkData.path.split("/")[1]
                        node = pageId + "_" + storyId
                        base_url = graphurl + node + fields + parameters

                        fb_data = queryGraphApi(base_url)
                        if(fb_data == 400):
                            print( bcolors.FAIL + "ERROR" + bcolors.ENDC + ": The page went private!!")
                            continue


                        if 'shares' in fb_data.json():
                            shares += fb_data.json()['shares']['count']

                        for emotion in emotions:
                            if emotion in fb_data.json():

                                emotionValue = fb_data.json()[emotion]['summary']['total_count']
                                dataDict[emotion] = emotionValue

                                if emotion in totalEmotions:
                                    totalEmotions[emotion]+=emotionValue
                                else:
                                    totalEmotions[emotion]=emotionValue
                            else:
                                dataDict[emotion]=np.nan

                        try:
                            likes,followers = getFollowers(linkDict)
                            # likes,followers = parsePage(pageId)
                            numOfPages+=1
                            if pageId in visibility:
                                visibility[pageId]+=followers
                            else:
                                visibility[pageId]=followers
                        except:
                            print( bcolors.FAIL + "ERROR: Unable to get likes and followers!" + bcolors.ENDC )
                            missingDataPages+=1
                            likes = np.nan
                            followers = np.nan

                        dataDict['page_likes'] = likes
                        dataDict['page_followers'] = followers

                        if 'message' in fb_data.json():
                            dataDict['message'] = fb_data.json()['message']
                            dataDict['has_text'] = 1
                        else:
                            dataDict['message'] = np.nan
                            dataDict['has_text'] = 0

                        linkEmotionList.append(dataDict)

                    if(len(linkEmotionList) != 0):
                        for column in header[:-1]: #we don't save the url, as we are going to overwrite it with the list of dictionary
                            if column in row:
                                current_row.append(row[column])
                            else:
                                current_row.append(np.nan)

                        total_visibility = 0
                        for key,value in visibility.items():
                            total_visibility += value

                        if(numOfPages == 0):
                            numOfPages=1
                        missingFollowers = (total_visibility//numOfPages)*missingDataPages
                        if(missingFollowers > 0):
                            print( bcolors.WARNING + "WARNING" + bcolors.ENDC+": Adding {} followers as {} pages were missing!".format(missingFollowers,missingDataPages))
                            total_visibility += missingFollowers

                        current_row.append(linkEmotionList)
                        current_row.append(shares)
                        current_row.append(total_visibility)
                        
                        if shares != old_shares:
                            lastEqualRows = 0
                            print(bcolors.WARNING + "WARNING" + bcolors.ENDC+": Old shares were {}, new shares are: {}".format(old_shares,shares))
                        else:
                            lastEqualRows+=1
                            print(bcolors.OKGREEN + "GOOD" + bcolors.ENDC+":Last {} had Same shares {} == {}".format(lastEqualRows,old_shares,shares))


                        for key,value in totalEmotions.items():
                             current_row.append(value)
                        csvwriter.writerow(current_row)
                        parsedLines+=1
                        print( bcolors.OKGREEN + "[ {} ] parsed line {}/{}".format(datetime.datetime.now(),parsedLines,row_count) + bcolors.ENDC)
                    else:
                        print( bcolors.WARNING + "[ {} ] skipped line {}/{}".format(datetime.datetime.now(),parsedLines,row_count) + bcolors.ENDC)
                    # sleep(18)
        except KeyboardInterrupt:
            cut_files(input_file,parsedLines,fd)
        # except HTTPSConnectionPool:
            #want we to wait that the connection is restored and continue?
        except Exception as e:
            print ( bcolors.FAIL + "ERROR" + bcolors.ENDC + ": Failed randomly: "+str(e))
            cut_files(input_file,parsedLines,fd)
            os.system('spd-say "your program failed randomly"')
    print( bcolors.OKGREEN + "[ {} ] Completed the task".format(datetime.datetime.now()) + bcolors.ENDC)
    cut_files(input_file,parsedLines,fd)
    os.system('spd-say "your program completed the task"')


# Process Dataset with Natural Language Processing

In [12]:
import pandas as pd
import numpy as np
import re
import nltk
import matplotlib.pyplot as plt
import math
from numpy import nan

- The function __normalize_document__ uses nltk to clean the text from stop words, links and other non readable characters and returns a string with just the filtered tokens

- The function __dict_to_string__ is used to convert a dictionary to it's string representation. We do this so we are able to save it as a string object inside our csv, so reading from the csv will be compatible with all the other code we wrote, where we read the string and we use eval to evaluate it to a dictionaty

- The function __normalize_fb_posts__ has the same effect of the normalize document function, but it's used to normalize the text written by the pages who shared the article, as the data in that case is contained in a dictionary

In [6]:
wpt = nltk.WordPunctTokenizer()
# nltk.download()
stop_words = nltk.corpus.stopwords.words('english')
english_words = set(nltk.corpus.words.words()) #list of all english words

def normalize_document(doc):
    if type(doc) is str:
#         print('string')
        # lower case and remove special characters\whitespaces
        doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
        doc = doc.lower()
        doc = doc.strip()
        # tokenize document
        tokens = wpt.tokenize(doc)
        # filter stopwords and non english words out of document
        filtered_tokens = [token for token in tokens if token not in stop_words]
        filtered_tokens = [token for token in tokens if token in english_words]
        filtered_tokens = [token]
        # re-create document from filtered tokens
        doc = ' '.join(filtered_tokens)

        return doc
    else:
        return doc     

def dict_to_string(dictionary):
    return '{'+', '.join("'{!s}': {!r}".format(key,val) for (key,val) in dictionary.items())+'}'

def normalize_fb_posts(postsList):

    posts = eval(postsList)

    for post in posts:
        if type(post['message']) is str:
            message = normalize_document(post['message'])
            post['message'] = message
            
    return '[' + ', '.join(dict_to_string(x) for x in posts) + ']'

We read from the dataset, and apply the normalization of the texts to the title of the article, the abstract and to the messages written by the pages who shared the post.
At the end we save the results in a new csv file

In [7]:
df = pd.read_csv('/media/mfattoru/Backup Data/altmetric_Dataset/extracted_dataset/altmetrics_dataset_facebook.csv')
df['title'] = df['title'].apply(lambda x: normalize_document(x))
df['abstract'] = df['abstract'].apply(lambda x: normalize_document(x))
df['fb_wall_urls'] = df['fb_wall_urls'].apply(lambda x: normalize_fb_posts(x))
df.to_csv('/media/mfattoru/Backup Data/altmetric_Dataset/extracted_dataset/cleaned_dataset.csv',index=False)

For the sentiment analysis, we used the library textblob from python.

This library offers an already trained PatternAnalyzer model for sentiment analysis.
The model will return a float value between -1 and 1, which represents the polarity of the analyzed text.

We then normalize the values so that the algorithm will return just the integers 1,0 and -1

for the bacebook post aalysis, the resulting sentiment is equal to the sum of the sentiments of all the shares, then normalized in the integer form 1,0,and -1

In [10]:
from textblob import TextBlob
import re

def clean_text(text):
    '''
    Utility function to clean the text in a tweet by removing 
    links and special characters using regex.
    '''
    if type(text) is str:
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", text).split())
    else:
        return text

def analize_sentiment(text):
    '''
    Utility function to classify the polarity of a tweet
    using textblob.
    '''
    if type(text) is str:
        analysis = TextBlob(clean_text(text))
        if analysis.sentiment.polarity > 0:
            return 1
        elif analysis.sentiment.polarity == 0:
            return 0
        else:
            return -1
    else:
        return 

def fb_posts_analysis(postList):
#     print(postList)
    posts = eval(postList)
    res = 0
    hasMessages = False
    for post in posts:
#         print(post)
        if type(post['message']) is str:
            hasMessages = True
            message = normalize_document(post['message'])
            analysis = TextBlob(clean_text(post['message']))
            res += analysis.sentiment.polarity
    
    if hasMessages:
        if res > 0:
            return 1
        elif res == 0:
            return 0
        else:
            return -1
    else:
        return np.nan


We apply the sentiment analysis to the title, abstract and shared posts, and save the values as new features called __title_sentiment__, __abstract_sentiment__, and __facebook_sentiment__

In [11]:
df = pd.read_csv('/media/mfattoru/Backup Data/altmetric_Dataset/extracted_dataset/cleaned_dataset.csv')
df['title_sentiment'] = df['title'].apply(lambda x: analize_sentiment(x))
df['abstract_sentiment'] = df['abstract'].apply(lambda x: analize_sentiment(x))
df['facebook_sentiment'] = df['fb_wall_urls'].apply(lambda x: fb_posts_analysis(x))

df.head()

Unnamed: 0,altmetric_id,title,subjects,abstract,pubdate,fb_wall_count,scopus_subjects,publisher_subjects,fb_wall_urls,shares,visibility,total_like,total_love,total_wow,total_haha,total_sad,total_angry,title_sentiment,abstract_sentiment,facebook_sentiment
0,16937763.0,risk ischemic stroke transient ischemic attack increased days noncarotid noncardiac surgery,['brain'],risk stroke cardiac carotid surgery well established contrast stroke risk association noncardiac noncarotid surgery time course insufficiently known investigated prevalence recent planned surgery ...,2017-02-28T00:00:00+00:00,1,Health Sciences,Clinical Sciences,"[{'link': 'https://www.facebook.com/permalink.php?story_fbid=947255595408598&id=133050910162408', 'like': 3, 'love': 0, 'wow': 0, 'haha': 0, 'sad': 0, 'angry': 0, 'page_likes': 2096175, 'page_foll...",0,2096651,3,0,0,0,0,0,0.0,1.0,1.0
1,31057208.0,understanding targeting uptake hiv testing among gay bisexual men attending sexual health clinics,"['acquiredimmunodeficiencysyndrome', 'behavioralsciences']",assessed trends hiv testing outcomes period clinicbased initiatives introduced increase hiv testing among gay bisexual men gbm attending sexual health clinics shcs new south wales nsw cohort hivne...,2017-12-19T00:00:00+00:00,1,Health Sciences,Health Psychology,"[{'link': 'https://www.facebook.com/permalink.php?story_fbid=1794357743949490&id=174782242573723', 'like': 3, 'love': 0, 'wow': 0, 'haha': 0, 'sad': 0, 'angry': 0, 'page_likes': 1624, 'page_follow...",0,1662,3,0,0,0,0,0,1.0,1.0,1.0
2,27637930.0,still lonely social adjustment youth without social anxiety disorder following cognitive behavioral therapy,['psychiatry'],social experiences integral part normative development youth social functioning difficulties related poor outcomes youth anxiety disorders particularly social anxiety disorder experience difficult...,2017-10-01T00:00:00+00:00,1,Health Sciences,Clinical Sciences,"[{'link': 'https://www.facebook.com/permalink.php?story_fbid=1662682910468753&id=450103581726698', 'like': 0, 'love': 0, 'wow': 0, 'haha': 0, 'sad': 0, 'angry': 0, 'page_likes': 1717, 'page_follow...",1,1770,0,0,0,0,0,0,-1.0,1.0,-1.0
3,28052326.0,usefulness electrical auditory brainstem responses assess functionality cochlear nerve using intracochlear test electrode,"['otolaryngology', 'neurology']",use intracochlear test electrode assess integrity functionality auditory nerve cochlear implant ci recipients compare electrical auditory brainstem responses eabr via test electrode eabr responses...,2017-10-01T00:00:00+00:00,2,Health Sciences,Zoology,"[{'link': 'https://www.facebook.com/permalink.php?story_fbid=992368990906348&id=376404182502835', 'like': 0, 'love': 0, 'wow': 0, 'haha': 0, 'sad': 0, 'angry': 0, 'page_likes': 72, 'page_followers...",1,142,0,0,0,0,0,0,0.0,1.0,0.0
4,27585764.0,impact pediatric obesity acute asthma exacerbation japan,"['pediatrics', 'allergyandimmunology']",asthma obesity common health problems children study investigated impact obesity children hospitalized acute asthma exacerbation obtained hospital discharge records inpatients aged years diagnosis...,2017-10-18T00:00:00+00:00,3,Health Sciences,Immunology,"[{'link': 'https://www.facebook.com/permalink.php?story_fbid=939393522865567&id=143587395779521', 'like': 5, 'love': 0, 'wow': 0, 'haha': 0, 'sad': 0, 'angry': 0, 'page_likes': 100253, 'page_follo...",8,206236,22,1,0,0,0,0,1.0,1.0,1.0


We now store the values in a new csv file

In [12]:
df.to_csv('/media/mfattoru/Backup Data/altmetric_Dataset/extracted_dataset/nlp_dataset.csv',index=False)