## Data Preparation

We filter and prepare the data downloaded from http://ris3mcat.gencat.cat.

We perform:
- attribute selection
- filtering based on language (English)
- duplicate removal

### ATTENTION

SDG 8 (decent work and economic growth), SDG 9 (industry, innovation and infrastructure) and SDS 17 (alliance for objectives) have been excluded originally in the dataset, as they would have no effect on data filtering, given that these three objectives are inherent in all the collaborative R & D & I projects that make up the Platform.

In [None]:
import pandas
import numpy as np
import spacy, en_core_web_sm, en_core_web_lg
from spacy_langdetect import LanguageDetector

In [None]:
projects_df = pandas.read_csv('data/ris3-mcat-projects.csv')

projects_df = projects_df[['projectId', 'projectTitle', 'projectAbstract', 'sdgName']]
projects_df = projects_df.sort_values(by=['projectId'], ignore_index=True)
projects_df.head(20)

In [None]:
nlp = en_core_web_lg.load()
nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)

new_matrix = []

previous_pid = 'xxxxx'
start_index = 0
for index, row in projects_df.iterrows():
    
    if index < 1:
        continue
        
    if row['projectId'] != previous_pid:
            
        doc = nlp(row['projectAbstract'])
        if doc._.language['language'] not in ('en','es'):
            print(row['projectAbstract'])
            print()
            goals = []
            for g in list(projects_df.sdgName[start_index:index]):
                if type(g) == type('str'):
                    g = g.split()[1][:-1]
                    goals.append(g)

            goals = list(set(goals))
            goals = ','.join(goals)
            new_row = [previous_pid, projects_df['projectTitle'][index-1], projects_df['projectAbstract'][index-1], goals]
            
            #new_row = list(row[['projectId','projectTitle','projectAbstract']]) + [goals]
                      
            new_matrix.append(new_row)
            
        previous_pid = row['projectId']
        start_index = index
        
new_df = pandas.DataFrame(data=new_matrix, columns=projects_df.columns)
new_df.head(50)

In [None]:
new_df.to_csv('data/ris3-mcat-projects-cleaned-catalan.csv', index=False, sep='\t')

In [None]:
new_df.head(50)
print(len(new_df['projectId']))

In [None]:
goals_df = pandas.read_excel('data/un-goals.xlsx')
goals_df.head(20)