# NLP: gathering quotes of people in STEM in Spanish written press articles

1. [Rationale](#rationale)
2. [Load dataset and libraries](#load-reqs)
3. [Collect names of STEM people quoted](#names-collection)
4. [Assign gender to collected names](#gender-assignment)
5. [Curate dataset of STEM people names](#curate-names)
6. [Compute # gender mentions](#gender-mentions)
7. [Evaluation](#evaluation)
8. [Export dataset](#export)

##  1. Rationale <a class="anchor" id="rationale"></a>

For detecting human names, I've built a pipeline to analyze article text and collect names of STEM-related people, assign gender to collected names and compute the number of men and women that are mentioned in the entire article. The process follows these steps:
<ol>
    <li>Build a classifier of press articles mentioning STEM people. The classifier identifes those articles mentioning STEM people and collect both the names of STEM people mentioned and the sentences where they appear (to ease the evaluation of the classifier). The classifier uses a rule based approach (when certain keywords are found in the article text) to identify articles mentioning STEM people together with a NLP Named Entity Recognition (NER) process as explained <a href="https://unbiased-coder.com/extract-names-python-spacy/">here</a> to collect human names (see below)</li>
    <li>Assign gender to names of STEM people collected, using a library called Gender-Guesser</li>
    <li>Refinement: after analyzing the preliminar results, I've curated the dataset of STEM people names (basically, to remove non-STEM people such as historical people - i.e. kings, writers, - and politicians) that are unwantedly collected during the step 1 above</li>
    <li>Compute the number of STEM men and women mentioned in each article</li>
    <li>Validate the classifier by inspecting 100 articles (50 mentioning STEM people, 50 non-mentioning STEM people) to obtain the accuracy, precision and recall of the classifier</li>
    <li>Analyze accuracy of gender assignment by manually inspecting 150 names</li>
</ol>

Here, I consider STEM in a broad sense to include academic authorities and experts in any technical field, science, engineering, architecture or even humanities if it has any overlapping with science, technology, innovation or research. For example, economic academics are also considered here as STEM, as long as they are quoted as university professors or prestigious researchers/authorities in the field. The aim is to analyze gender presence in academic authorities quoted in Spanish press, including people from any field of knowledge involving research and innovation.

The accuracy of the classifier is analyzed by manually inspecting and labeling 100 articles. Only if the article quotes an academic or expert, then it is considered as STEM. If the article covers a science or technology topic but it does not quote any expert, then it is not considered as STEM (the aim is to measure how good is the pipeline to catch quotes by academic authorities and STEM people).

## 2. Load dataset and libraries <a class="anchor" id="load-reqs"></a>

In [1]:
import pandas as pd
from nltk.tokenize import sent_tokenize
import nltk
import spacy
import itertools
import re
import gender_guesser.detector as gender
import unicodedata
from sklearn.metrics import classification_report

Below we load the Spacy model in Spanish that will do the NER analysis to capture names. More info [here](https://spacy.io/models/es#es_core_news_lg)

In [2]:
spanish_nlp = spacy.load('es_core_news_lg')

The dataset comes from [SciWire NewsMine](http://sciride.org/news.html), a collection of 26 million front-page news from 172 news outlets in 11 countries collected from 2015 to 2020 at [The Internet Archive](https://archive.org/). From this dataset, I have used a subset of 4 major Spanish newspapers (ABC, El Pais, El Mundo, La Vanguardia) and taken a random sample of 50.000 articles across all days within the period between 2015 and 2020. As the result was not balanced for the 4 sources and the 6 years and the evolution of gender presence across years was an important question to investigate (and to avoid any potential source effect), the dataset was downsampled to have 1300 articles for each year and newspaper, yielding a final dataset of 26.000 articles covering the period from 2016 to 2020.

In [3]:
sample_26k_part1 = pd.read_csv('data/sample50k_abc_mun_pai_van_wrangled_downsampled_part1.csv', sep='‰', engine='python', index_col=0)
sample_26k_part2 = pd.read_csv('data/sample50k_abc_mun_pai_van_wrangled_downsampled_part2.csv', sep='‰', engine='python', index_col=0)
sample_26k_part3 = pd.read_csv('data/sample50k_abc_mun_pai_van_wrangled_downsampled_part3.csv', sep='‰', engine='python', index_col=0)
sample_26k = pd.concat([sample_26k_part1, sample_26k_part2, sample_26k_part3])
sample_26k['frontpage_date'] = pd.to_datetime(sample_26k['frontpage_date'])
sample_26k.shape

(26000, 7)

In [4]:
links = pd.read_csv('data/sample50k_links.csv', sep='‰', engine='python', index_col=0)

## 3. Collecting names of STEM people mentioned <a class="anchor" id="names-collection"></a>

I have attempted several approaches:
- First, I tried the simplest possible approach: tag an article as STEM related if its body text contains the word "scientist" (in Spanish, "cientific", to collect both men and women). Then, for those articles tagged as STEM related, perform a Named Entity Recognition (NER) analysis with Spacy on the entire article text and then take names of people appearing in the article. This approach, however, had 2 important shortcomings. First, it collected many articles the focus of which was largerly non-STEM (just mentioned the word "science" tangentially). Second, it collected many names that, because of the above, were not STEM people (i.e. politicians, historical people, artists and people related with criminal/forensic investigations)
- Then, I tried to refine the tagging of STEM-related articles by having a two step process: first, tag articles as potentially STEM related if at least one out of several keywords appear in the article text ("scientist", "mathematician", "engineer", "architect", etc). Second, keep the STEM tag if a number of "avoid" keywords appear (see below). Still, this approach collected an excess of names because the NER analysis was done on the entire article text.
- Finally, I kept the rule based approach to select STEM articles as above but restricted the NER-based collection of STEM people names to the very same sentences where they are mentioned. Presumably, this approach misses some people quotes, but the quotes that are captured do come from STEM people in a substantial proportion. This is the approach for which evaluation metrics are provided below.

To avoid collecting the name of the same person more than once, I have used 2 strategies:
- After collecting NERs for an article, the list of names is converted to a set to remove exact duplicated strings
- After collecting NERs for an article, check the distance between the strings (i.e. number of different characters) and when 2 strings have a distance of 2 or lower (these are considered mispellings), keep only one of those 2 strings
- After finishing NER collection for the entire dataset, check if any name string is contained in another collected name. If so, remove the longest string. This is explained below in section `5. Curate STEM people found`

In [5]:
def mentions_stem(articles, nlp, n=0):
    ''' Collects names of STEM people mentioned and stores them in a different dataframe
    
    INPUT
        articles (dataframe): contains articles body text
        nlp (spacy.lang.es.Spanish): loaded spacy model
        n (int): number of rows of articles dataframe to analyze
    
    OUTPUT:
        people_df (dataframe): contains article_id as index and a column called person
        sentences_df (dataframe): sentences containing 'cientific'
    '''
    
    people_df = pd.DataFrame()
    sentences_stem_df = pd.DataFrame()    
    stem_terms = ['científic', 'ingenier', 'matemátic', 'informátic', \
                  'investiga','arquitect', 'expert', 'profesor', \
                  'catedrátic', 'universidad']
    
    avoid_nonstem = ['pensador', 'filósof', 'filosofía', 'reportaje', 'en comunicación', 'periodista', 'litera', \
                    'episcopal', 'diócesis', 'activis', 'de inglés', 'oposiciones', 'Filología', 'Educación Física', \
                     'párroco', 'Corresponsal', 'casting']
    
    avoid_cultura = ['cantante', 'canción', 'actriz', 'actor', 'cine', 'presentador', 'baile', 'interpretación', 'concursante', \
                    'Bellas Artes', 'arte', 'artista', 'músic', 'escrit', 'novela', 'poeta', 'poema', 'tebeo', 'dibujante', 'cómic', \
                    'moda', 'diseñador', 'ballet', 'conservatorio', 'taller', 'dramaturgo']
    
    avoid_economy = ['hipoteca', 'consultor']
    
    avoid_sports = ['jugador', 'fútbol', 'campeón', 'piloto', 'gimnasia', 'torneo', 'olímpic', 'FIFA', 'UEFA']
    
    avoid_politics = ['polític', 'diputad', 'Congreso', 'Senado', 'ministr', 'ministerio', 'Ministerio', \
                      'PSOE', 'concejal', 'alcalde', \
                      'ayuntamiento', 'embajador', 'Diputación', 'senador', \
                      'comunista', 'consejero', 'conselleiro', 'secretario de Estado', 'Guerra Civil', 'Gobierno', \
                     'consejería', 'rey', 'lehendakari', 'portavoz', 'Embajada', 'candidat', 'procés']
    
    avoid_legal = ['polic', 'Policía', 'Guardia Civil', 'delito', 'extrad', 'fiscal', 'Fiscal', 'prisión', 'Audiencia', \
                   'procesal', 'juez', 'corrupción', 'judicial',  'presunt', 'supuest', 'juzgado', 'cárcel', 'preso', 'condenado', \
                  'criminal', 'testigo', 'Derecho', 'derecho', 'asesin', 'abogad', 'grupo armado', 'jurisdicción', 'arrest', \
                   'Tribunal', 'malversación', 'detective', 'jurídica', 'fuerzas de seguridad', 'soborno', 'subasta', \
                  'cadáver', 'Urdangarín', 'magistrad', 'magistratura']
    
    avoid_terms = avoid_nonstem + avoid_cultura + avoid_economy + avoid_sports + avoid_politics + avoid_legal 
    if n != 0:
        articles = articles.iloc[:n]
    for article_id,row in articles.iterrows():
        txt = articles[articles.index==article_id]['body_text'].values
        txt = ''.join(txt)
        found_sents = sent_tokenize(txt) # this nltk function looks faster than doing entire Spacy NLP on full text
        to_do_ner = False
        sentences = []
        for sentence in found_sents:
            # one sentence may have several stem_terms, avoid storing the same sentence several times
            if sentence not in sentences:
                for term in stem_terms:
                    if term in sentence:
                            store = True
                            for avoid_term in avoid_terms:
                                if avoid_term in sentence:
                                    store = False
                            if store == True:
                                sentences.append(sentence)
        # remove duplicate sentences, it happens sometimes
        sentences = list(set(sentences))
        # store sentences_with_stem_terms in a sentences_stem dataframe for evaluation          
        if len(sentences)>0:
            people = []
            for sentence in sentences:
                sentences_stem_df = pd.concat([sentences_stem_df, pd.DataFrame([[article_id, sentence]], columns=['article_id','sentence'])])
                # search human names in those sentences
                
                document = nlp(sentence)
                for named_entity in document.ents:
                    if named_entity.label_ == 'PER':
                        # do not store single words. Names almost always consist of 2+ words
                        # except for very famouse people (i.e. Tedros, WHO president) and should
                        # be fully named at least once in the article
                        if len(named_entity.text.split()) > 1:
                            entity_name = named_entity.text
                            # clean recovered text
                            entity_name = entity_name.strip(' .; ')
                            entity_name = ''.join(letter for letter in entity_name if letter not in ['\r','\n'])
                            entity_name = re.sub("^(sr|dra|prof|dr)(\.?)", '', entity_name, flags=re.IGNORECASE).strip()
                            people.append(entity_name)
            # remove duplicates
            people = list(set(people))
            # remove misspelings (edit_distance <=2 )
            pair_combinations = list(itertools.combinations(people,2))
            people_duplicates = []
            for name_a,name_b in pair_combinations:
                # only if the 2 names have 3 or more different characters, they will considered different names
                if nltk.edit_distance(name_a,name_b) < 3:
                    people_duplicates.append(name_a)
                    #people_duplicates = list(set(people_duplicates))
            people_rev = list(set(people) - set(people_duplicates))
            # save results in people dataframe
            for person in people_rev:
                people_df = pd.concat([people_df, pd.DataFrame([[article_id, person]], columns=['article_id','full_name'])])  

    # if there is data, set index to article_id
    if len(people_df) >0:
        people_df.set_index('article_id', inplace=True)
    if len(sentences_stem_df) >0:
        sentences_stem_df.set_index('article_id', inplace=True)
    return people_df, sentences_stem_df

## 4. Assigning gender to names <a class="anchor" id="gender-assignment"></a>

To assign gender to collected names, I use the [gender-guesser](https://pypi.org/project/gender-guesser/) library, which according to [this study](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7924484/) yields the best results among several options, with only 2.6% of misclassifications (compared to [nameapi](https://www.nameapi.org/index.php?id=215), genderize.io (the same used in [this paper](https://elifesciences.org/reviewed-preprints/84855) analyzing thousands of news pieces published at Nature) - see python wrappers [here](https://github.com/acceptable-security/gender.py) or [here](https://github.com/SteelPangolin/go-genderize) or [here](https://github.com/kalimu/genderizeR) - and [NamSor](https://namsor.app/) - see a python wrapper [here](https://github.com/namsor/namsor-python-tools-v2))

In [6]:
def strip_accents(s):
    ''' Removes accents from names to avoid problems when assigning gender with Spanish names
    '''
    return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

In [7]:
def assign_gender(names_df):
    ''' Creates a column in dataframe with genre detected by engine
    
    INPUT:
        names_df (dataframe): dataframe containing names in a column called 'full_name'
    OUTPUT:
        names_df (dataframe): includes a column called 'gender'
        
    '''
    # initialize gender-gesser
    g_guess = gender.Detector(case_sensitive=False)
    
    # take the first word of the name
    names_df['gender'] = names_df['full_name'].apply(lambda x: g_guess.get_gender(strip_accents(x.split()[0])))
    # remove mostly_ to collapse into men or women
    names_df['gender'] = names_df['gender'].str.replace('mostly_','')
    
    return names_df

To facilitate the iterative process, I've used a pipeline to collect names, assign gender and print a evaluation report including the quoted sentences ('STEM_sentences') and the names of the STEM people found ('STEM people').

In [8]:
def pipeline_mentions(articles, nlp_model, sample_size=0):
    ''' Collect STEM people names and sentences where mentions occur
    INPUT:
        articles (dataframe)
        nlp_model (spacy model object): spacy loaded model ('es_core_news_lg') used for Named Entity Recognition
        sample_size = number of articles to process. 0 = all articles in dataset
    OUTPUT:
        articles (dataframe): now has 2 additional columns with the number of STEM men and STEM women mentioned in each 
        article
        people_stem (dataframe): the names of people found in articles
        sentences_stem (dataframe): sentences where STEM keywords are found
    '''
    if sample_size != 0:
        articles = articles.sample(n=sample_size)
    people_stem, sentences_stem = mentions_stem(articles, nlp_model)
    people_stem = assign_gender(people_stem)
    return articles, people_stem, sentences_stem

In [9]:
articles, people_stem, sentences_stem = pipeline_mentions(sample_26k, spanish_nlp)

Let's explore gender assignment here. I will evaluate it below in the section "Evaluation"

In [10]:
people_stem.shape

(3612, 2)

In [11]:
people_stem['gender'].value_counts()

male       2240
female      752
unknown     548
andy         72
Name: gender, dtype: int64

In [12]:
people_stem['full_name'].nunique()

3360

In [13]:
people_stem.duplicated().sum()

252

### Correct gender assignment in most prominent cases

Let's correct a few of those tagged as 'unknown' or 'andy' (only those that appear in more than 1 article)

In [14]:
people_stem[people_stem['gender']=='andy'].reset_index().groupby('full_name').count().reset_index().sort_values(by='article_id', ascending=False)

Unnamed: 0,full_name,article_id,gender
51,Xi Jinping,4,4
29,Mies van der Rohe,4,4
40,Santos Juliá,2,2
41,Saray Durán-Sanchón,1,1
46,Tiantian Yuan,1,1
...,...,...,...
26,Li Yonghui,1,1
27,Lintao Zhang,1,1
28,Mao Zedong,1,1
30,Ningyu Liu,1,1


In [15]:
people_stem.loc[people_stem['full_name']=='Xi Jinping', 'gender'] = 'male'

In [16]:
people_stem.loc[people_stem['full_name']=='Mies van der Rohe', 'gender'] = 'male'

In [17]:
people_stem.loc[people_stem['full_name']=='Santos Juliá', 'gender'] = 'male'

In [18]:
people_stem.loc[people_stem['full_name']=='Gay de Liébana', 'gender'] = 'male'

In [19]:
people_stem[people_stem['gender']=='andy'].reset_index().groupby('full_name').count().reset_index().sort_values(by='article_id', ascending=False).head()

Unnamed: 0,full_name,article_id,gender
0,Bin Salman,1,1
46,Wei-Ping Andrew Lee,1,1
33,Pin ochet,1,1
34,Qing Li,1,1
35,Ri Sol Ju,1,1


Let's explore and correct a couple names, those appearing in several articles, whose genre could not be identified:

In [20]:
unknown = people_stem[people_stem['gender']=='unknown'].reset_index().groupby('full_name').count().reset_index().sort_values(by='article_id', ascending=False)
unknown[0:50]

Unnamed: 0,full_name,article_id,gender
397,Rovira i Virgili,4,4
392,Rocío Escalante,3,3
368,Quim Torra,3,3
394,Rolls Royce,2,2
64,Big Vang,2,2
230,Joaquín Arango,2,2
475,Zaha Hadid,2,2
234,Joaquín Maudos,2,2
418,Sherlock Holmes,2,2
29,Antropología Física,2,2


In [21]:
people_stem.loc[people_stem['full_name']=='Rocío Escalante', 'gender'] = 'female'

In [22]:
people_stem.loc[people_stem['full_name']=='Zaha Hadid', 'gender'] = 'female'

In [23]:
people_stem.loc[people_stem['full_name']=='Joaquín Arango', 'gender'] = 'male'

In [24]:
people_stem.loc[people_stem['full_name']=='Ieoh Ming Pei', 'gender'] = 'male'

In [25]:
people_stem.loc[people_stem['full_name']=='Joaquín Maudos', 'gender'] = 'male'

In [26]:
people_stem.loc[people_stem['full_name']=='Esteve Fernández', 'gender'] = 'male'

In [27]:
people_stem[people_stem['gender']=='unknown'].reset_index().groupby('full_name').count().reset_index().sort_values(by='article_id', ascending=False).head()

Unnamed: 0,full_name,article_id,gender
392,Rovira i Virgili,4,4
364,Quim Torra,3,3
507,Álvarez Conde,2,2
185,Hacienda Pública,2,2
102,Cristóbal Colón,2,2


## 5. Curate STEM people found <a class="anchor" id="curate-names"></a>

As seen above, some of the names collected do not correspond to STEM people but a few politicians and historical people (i.e. kings) are also collected. Let's explore it in more detail and curate the collected dataset.

In [30]:
def print_article(article_id, chars= 1000):
    ''' Prints a fragment of the article, together with its article id, title, authors and links
    
    INPUT:
        article_id (int)
        chars (int): number of characters to print
    '''

    title = articles.loc[articles.index== article_id]['title'].values[0]
    body = articles.loc[articles.index== article_id]['body_text'].values[0]
    print("TITLE: " + title)
    print()
    if chars != 0:
        print(body[:chars])
    else:
        print(body)
    print()
    print()
    print("STEM_PEOPLE:")
    print(people_stem[people_stem.index==article_id]['full_name'].tolist())
    print()
    print("STEM_SENTENCES:")
    print(sentences_stem[sentences_stem.index==article_id]['sentence'].tolist())    

In [32]:
print_article('d4647ca3bc0e0cba84f33cb970bf91e9')

TITLE: Ni el melón es indigesto por la noche ni el pollo tiene hormonas: la guía que desmonta bulos en alimentación

Ni el melón es indigesto por la noche ni el pollo tiene hormonas: la guía que desmonta bulos en alimentación Los expertos Beatriz Robles, Gemma del Caño y Pablo Ojeda aportan evidencia científica frente a las falsas creencias

Bulos en alimentación hay muchos y las frutas protagonizan la mayor parte de ellos. Que si comer plátano engorda , que si el melón o la sandía sientan mal por la noche ... «El plátano contiene mucha fibra, ésta es importante para mantener los hábitos intestinales regulares, desempeña un papel vital en la salud digestiva y aporta saciedad lo cual es muy importante en esa bajada de peso», explica el dietista Pablo Ojeda. Y toda la fruta se puede consumir a cualquier hora del día. « Ni el melón, ni la sandía son indigestos por la noche y tampoco la fruta fermenta después de las comidas», destaca este experto que, junto a la farmacéutica Gemma del Caño

### Clean names text

To clean and process names, let's create a dataframe with unique names in the dataset

In [33]:
people_unique = people_stem.reset_index().groupby('full_name').count().reset_index().sort_values(by='article_id', ascending=False)

#### Remove dots

In [34]:
people_unique[people_unique['full_name'].str.contains('\.')]

Unnamed: 0,full_name,article_id,gender
2429,N. Ursúa,1,1
2428,Mónica R. Goya“En mi opinión,1,1
2350,Miguel A. Martínez-González,1,1
2348,Miguel A. Hernán,1,1
2311,Maxim A. Suchkov,1,1
...,...,...,...
1209,Hasan G. López,1,1
1266,Husseini K. Manji,1,1
1113,George W. Bush,1,1
1174,Gustavo E. Romero,1,1


In [35]:
people_stem['full_name'].replace('\.',' ', regex=True, inplace=True)
people_stem['full_name'] = people_stem['full_name'].apply(lambda x:" ".join(x.split()))
people_unique = people_stem.reset_index().groupby('full_name').count().reset_index().sort_values(by='article_id', ascending=False)

In [36]:
people_unique[people_unique['full_name'].str.contains('\.')]

Unnamed: 0,full_name,article_id,gender


In [37]:
people_unique

Unnamed: 0,full_name,article_id,gender
749,Donald Trump,17,17
1099,Gay de Liébana,11,11
2611,Pedro Sánchez,7,7
766,EL MUNDO,7,7
480,Carlos III,6,6
...,...,...,...
1163,Guadalupe Sabio,1,1
1164,Guillaume Chomicki,1,1
1165,Guillermo Cisneros,1,1
1166,Guillermo Morenés,1,1


#### Remove single word names

In [38]:
people_unique['words'] = people_unique['full_name'].apply(lambda x: len(x.split()))
people_unique.loc[people_unique['words']==1]

Unnamed: 0,full_name,article_id,gender,words
2282,MatemáticasEn,1,1,1
0,0Atiyah,1,1,1
2497,ORGASMOA,1,1,1
1874,Kogan,1,1,1
1940,LigüikiTras,1,1,1
1810,Juninho“,1,1,1
2065,Madrid@marcosbd10,1,1,1
3074,TANEl,1,1,1
3134,TrumpAunque,1,1,1
3015,Somos5La,1,1,1


In [39]:
one_word = people_unique.loc[people_unique['words']==1]['full_name'].to_list()
people_stem = people_stem.loc[~people_stem['full_name'].isin(one_word)]
people_unique = people_stem.reset_index().groupby('full_name').count().reset_index()
people_unique['words'] = people_unique['full_name'].apply(lambda x: len(x.split()))

In [40]:
people_unique.sort_values(by='words')

Unnamed: 0,full_name,article_id,gender,words
0,11 35La,1,1,2
1707,José Reque,1,1,2
2821,Rodrigo Elías,1,1,2
2820,Rodrigo Echenique,1,1,2
2819,Rodrigo Duterte,1,1,2
...,...,...,...,...
144,Alfredo Valido de la Estación Biológica de Doñana,1,1,8
2606,Pesca i Alimentació de la Generalitat de Catal...,1,1,8
785,Eduardo Tizzano Jefe de Genética del Hospital ...,1,1,9
715,Decano Facultad Ciencias Jurídicas y Economica...,1,1,11


#### Explore names that contain too many words

As shown below, these names do contain real author names, even if they also include the position. Some non-real names do not have a detected gender, so they will not compute for the proportion. Thus, we leave this as it is.

In [41]:
people_unique.sort_values(by='words', ascending=False, inplace= True)
people_unique.loc[people_unique['words']>4]

Unnamed: 0,full_name,article_id,gender,words
3145,Vicente Arribas de Paz Jefe Servicio Urgencias...,1,1,13
715,Decano Facultad Ciencias Jurídicas y Economica...,1,1,11
785,Eduardo Tizzano Jefe de Genética del Hospital ...,1,1,9
2606,Pesca i Alimentació de la Generalitat de Catal...,1,1,8
144,Alfredo Valido de la Estación Biológica de Doñana,1,1,8
857,Enfermedades Infecciosas de la Escuela de Higiene,1,1,7
1118,Gilles Edan del Hospital Universitario de Rennes,1,1,7
1764,Juan Vilchez de la Fé de Valencia,1,1,7
2848,Rosa Rugani de la Universidad de Padua,1,1,7
2285,Maurizio Gjivovich © Fondazione Guelpa“La ciudad,1,1,6


Let's explore if these names are duplicated. Later I tackle this issue.

In [42]:
people_stem[people_stem['full_name'].str.contains('Vicente Arribas')]

Unnamed: 0_level_0,full_name,gender
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1
b074aa509f684da596a280f188a00919,Vicente Arribas de Paz Jefe Servicio Urgencias...,male


In [43]:
print_article('b074aa509f684da596a280f188a00919')

TITLE: Qué remedios naturales que funcionan realmente contra la tos

La tos es una de las consecuencias menos deseadas de los resfriados y gripes estacionales que acompañan al invierno. Irritante e irritable es un fenómeno que puede acompañarnos días e incluso semanas y acabar desesperando a quienes la sufren y a su entorno.

“La tos se puede definir como un mecanismo que posee el organismo para expulsar algún material que se encuentra en vía aérea, como la mucosidad u otro cuerpo extraño. En algunos casos pude deberse a la irritación de la mucosa faríngea, laríngea traqueal o bronquial”, indica Dr. Vicente Arribas de Paz, Jefe Servicio Urgencias del Hospital Universitario Sanitas La Zarzuela.

La tos se puede definir como un mecanismo que posee el organismo para expulsar algún material que se encuentra en vía aérea, como la mucosidad u otro cuerpo extraño” Vicente Arribas de Paz Jefe Servicio Urgencias del Hospital Universitario Sanitas La Zarzuela

Este reflejo natural que despeja nu

#### Explore if there are too short names

Character threshold = 6 (see below)

In [44]:
people_unique['chars'] = people_unique['full_name'].apply(lambda x: len(x))
people_unique.loc[people_unique['chars']<6].sort_values(by='chars')

Unnamed: 0,full_name,article_id,gender,words,chars
4,A R,1,1,2,3
3274,po r,1,1,2,4
2678,R No,1,1,2,4
2509,P - ¿,1,1,3,5
2512,P Leí,1,1,2,5
624,D Lee,1,1,2,5
1794,Jun 5,1,1,2,5


In [45]:
six_chars = people_unique.loc[people_unique['chars']<6]['full_name'].to_list()
people_stem = people_stem.loc[~people_stem['full_name'].isin(six_chars)]
people_unique = people_stem.reset_index().groupby('full_name').count().reset_index()
people_unique['words'] = people_unique['full_name'].apply(lambda x: len(x.split()))
people_unique['chars'] = people_unique['full_name'].apply(lambda x: len(x))
people_unique.sort_values(by='chars')

Unnamed: 0,full_name,article_id,gender,words,chars
1894,Le Pen,1,1,2,6
3226,Yi Cui,1,1,2,6
3220,Xu Bin,1,1,2,6
3196,Wu Lei,1,1,2,6
2861,S XVII,1,1,2,6
...,...,...,...,...,...
1686,José María Martínez SEGERSuspenso sistemáticoE...,1,1,6,58
783,Eduardo Tizzano Jefe de Genética del Hospital ...,1,1,9,65
713,Decano Facultad Ciencias Jurídicas y Economica...,1,1,11,70
3139,Vicente Arribas de Paz Jefe Servicio Urgencias...,1,1,13,98


In [46]:
people_unique.shape

(3309, 5)

#### Duplicates

Some names are duplicated, see below:

In [47]:
people_unique.loc[people_unique['full_name'].str.contains('Marta Hervera')]

Unnamed: 0,full_name,article_id,gender,words,chars
2183,Marta Hervera,1,1,2,13
2184,Marta HerveraExperta europea,1,1,3,28


In [48]:
people_stem.loc[people_stem.index=='75359b0a5c83c27e66d003dd42d7ec79']

Unnamed: 0_level_0,full_name,gender
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1
75359b0a5c83c27e66d003dd42d7ec79,Marta Hervera,female
75359b0a5c83c27e66d003dd42d7ec79,Marta HerveraExperta europea,female
75359b0a5c83c27e66d003dd42d7ec79,Roberto Élices,male


In [49]:
print_article('75359b0a5c83c27e66d003dd42d7ec79')

TITLE: ¿Perros y gatos vegetarianos? La moda vegana llega a la dieta de las mascotas

La voluntad de que las mascotas sean uno más de casa e imiten la conducta de los humanos con los que conviven está afectando su alimentación. En una sociedad en la que cada vez más personas deciden alimentarse sin comer carne –con una dieta vegetariana–, o seguir una alimentación vegana –rechazando los alimentos de origen animal–, muchos perros y gatos están asumiendo las dietas de sus propietarios.

Los veterinarios y expertos en nutrición de perros y gatos constatan que se trata de una práctica que tiene cada día más seguidores, en una plasmación directa en el mundo de las mascotas de lo que sucede socialmente, con el veganismo como tendencia en auge.

“Recibimos peticiones para perros que quieren una dieta vegetariana; en Estados Unidos ya hace tiempo que se están dando este tipo de dietas y se han analizado dietas comerciales vegetarianas que existen en el mercado para ver si realmente tenían todo

From more than 3000 names collected, there are only 156 names that correspond to partially duplicated names (contained in other names):

In [50]:
duplicated = pd.DataFrame()
for idx_person,person in people_unique.iterrows():
    rest = people_unique.loc[people_unique['full_name'] != person['full_name']]
    duplicated = pd.concat([duplicated, rest.loc[rest['full_name'].str.contains(person['full_name'])]])

In [51]:
duplicated.sort_values(by='full_name')

Unnamed: 0,full_name,article_id,gender,words,chars
46,Agustín Sánchez Lavega,1,1,3,22
47,Agustín Sánchez Lavega Primer,1,1,4,29
47,Agustín Sánchez Lavega Primer,1,1,4,29
133,Alfonso Enseñat de Villalonga,1,1,4,29
189,Anatxu ZabalbeascoaSão Paulo,1,1,3,28
...,...,...,...,...,...
2987,Sonia de AssisComo,1,1,3,18
3050,Taqui Ongoy’ En el mundo andino,1,1,6,31
3159,Vladímir A Kozlov Profesor-investigador de la,1,1,6,45
3206,Xavier Triadó Investigador,1,1,3,26


In [52]:
duplicated_names = duplicated['full_name'].tolist()
people_stem.loc[people_stem['full_name'].isin(duplicated_names)].sort_values(by='full_name')

Unnamed: 0_level_0,full_name,gender
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1
472e4d0abb0dc3569596a4b4e33f487c,Agustín Sánchez Lavega,male
472e4d0abb0dc3569596a4b4e33f487c,Agustín Sánchez Lavega Primer,male
3fba8fc7a38d1d0417bd017486abd206,Alfonso Enseñat de Villalonga,male
e83b9afc8617146665bed0729f75abb0,Anatxu ZabalbeascoaSão Paulo,unknown
adb62fb792cd03b722549afb4c5c836c,Antonio CórdobaEl primer,male
...,...,...
a6a760e33ef85e8618d93d39b5d98ab3,Sonia de AssisComo,female
28ce887b940ac86bf7c17fa7312fc9fc,Taqui Ongoy’ En el mundo andino,unknown
a7b22ecbfd049685592e16b6322d8042,Vladímir A Kozlov Profesor-investigador de la,male
65d98bbf3e38c714639ddcd33634b93d,Xavier Triadó Investigador,male


Let's replace those full_names by the shortest name, but only for those who have an identified gender (otherwise, it may be the family name, in such a case we do not want to loose the first name, and unidentified genders will not be computed when calculating the number of mentions)

In [53]:
for idx_person,person in people_unique.iterrows():
    count_errors = 0
    # check if the person has gender assigned, otherwise pass
    try:
        if 'Miguel Ángel' not in person['full_name'] and 'María Jesús' not in person['full_name']: # to avoid having Miguel Ángel as the most common name
            if people_stem.loc[people_stem['full_name'] == person['full_name']]['gender'].isin(['male','female']).values[0]:
                # obtain a df with the other people
                rest = people_unique.loc[people_unique['full_name'] != person['full_name']]
                # get a df with duplicated names: names containing the name of the row in people_unique
                duplicated = rest.loc[rest['full_name'].str.contains(person['full_name'])]
                # for all duplicated found, replace the full_name by the name in row in people_unique (it is the shortest)
                for idx_duplicated,item_duplicated in duplicated.iterrows():
                    article_id = people_stem.loc[people_stem['full_name']==item_duplicated['full_name']].index.values[0]
                    people_stem.loc[(people_stem.index==article_id) & (people_stem.full_name==item_duplicated['full_name']),'full_name'] = person['full_name']
    except:
        count_errors = count_errors + 1
        pass
print("Errors in " + str(count_errors) + " STEM people")

Errors in 0 STEM people


If the replacement went well, the article below should not contain as STEM people 'Marta HerveraExperta europea' but 'Marta Hervera' (and 'Roberto Élices')

In [54]:
print_article('75359b0a5c83c27e66d003dd42d7ec79')

TITLE: ¿Perros y gatos vegetarianos? La moda vegana llega a la dieta de las mascotas

La voluntad de que las mascotas sean uno más de casa e imiten la conducta de los humanos con los que conviven está afectando su alimentación. En una sociedad en la que cada vez más personas deciden alimentarse sin comer carne –con una dieta vegetariana–, o seguir una alimentación vegana –rechazando los alimentos de origen animal–, muchos perros y gatos están asumiendo las dietas de sus propietarios.

Los veterinarios y expertos en nutrición de perros y gatos constatan que se trata de una práctica que tiene cada día más seguidores, en una plasmación directa en el mundo de las mascotas de lo que sucede socialmente, con el veganismo como tendencia en auge.

“Recibimos peticiones para perros que quieren una dieta vegetariana; en Estados Unidos ya hace tiempo que se están dando este tipo de dietas y se han analizado dietas comerciales vegetarianas que existen en el mercado para ver si realmente tenían todo

In [55]:
people_stem.loc[people_stem['full_name'].str.contains('Agustín Sánchez')]

Unnamed: 0_level_0,full_name,gender
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1
472e4d0abb0dc3569596a4b4e33f487c,Agustín Sánchez Lavega,male
472e4d0abb0dc3569596a4b4e33f487c,Agustín Sánchez Lavega,male


In [56]:
people_stem.loc[people_stem['full_name'].str.contains('Marta Hervera')]

Unnamed: 0_level_0,full_name,gender
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1
75359b0a5c83c27e66d003dd42d7ec79,Marta Hervera,female
75359b0a5c83c27e66d003dd42d7ec79,Marta Hervera,female


In [57]:
people_stem['gender'].value_counts()

male       2267
female      746
unknown     488
andy         61
Name: gender, dtype: int64

Now you can drop those STEM people that are duplicated in the same article, as the example above

In [58]:
people_stem.reset_index()[['article_id','full_name']].duplicated().sum()

41

In [59]:
duplicated = people_stem.reset_index()[['article_id','full_name']].duplicated()
people_stem.reset_index().loc[duplicated].sort_values(by='full_name')

Unnamed: 0,article_id,full_name,gender
505,472e4d0abb0dc3569596a4b4e33f487c,Agustín Sánchez Lavega,male
57,3fba8fc7a38d1d0417bd017486abd206,Alfonso Enseñat,male
2415,975265026f4eada2b8496a778293a893,Ana Laguna Pradas,male
639,64d0df447f5ccce2d6cc252001694be3,Anaya Hernández,male
2567,adb62fb792cd03b722549afb4c5c836c,Antonio Córdoba,male
1389,61ab08b310b8eb957f30b69ceb798c87,Barbara Oakley,female
224,542e0c2f310a8e0e9b9b555b20e9d9c0,Beatriz García Paesa,female
1954,73a852fbee9f5db53f5092a688eb9124,Blanca Lleó,female
2035,e62a1bc6b76c18f45dc0de06881dace8,Camilo José Cela,male
2569,adb62fb792cd03b722549afb4c5c836c,Carlos Spottorno,male


In [60]:
people_stem.reset_index(inplace=True)
people_stem.drop_duplicates(subset=['article_id','full_name'], inplace=True)
people_stem[['article_id','full_name']].duplicated().sum()

0

In [61]:
people_stem.set_index('article_id', inplace=True)

In [62]:
people_stem.reset_index().loc[people_stem.reset_index().duplicated(subset=['article_id', 'full_name'])].sort_values(by='full_name')

Unnamed: 0,article_id,full_name,gender


In [63]:
print_article('32903b469fefc6e0ec51f028c34e95bb')

TITLE: Un estudio sobre las 14 dietas más populares concluye que los beneficios no duran más de un año | Ciencia | EL PAÍS

Un estudio sobre las 14 dietas más populares concluye que los beneficios no duran más de un año Las restricciones reducen el riesgo de enfermedades cardiovasculares y funcionan para perder peso, pero es difícil mantenerlas

Varias verduras y hortalizas en un mercado.

La dieta Atkins, que apuesta por muchas proteínas y poco carbohidratos, la paleodieta, que se centra en los alimentos que estaban disponibles antes de la revolución neolítica, la dieta DASH para controlar la hipertensión y el ayuno intermitente son tan solo algunos ejemplos de las propuestas que recorren la web y prometen un resultado en pocos meses. Un estudio reciente publicado en la revista BMJ (British Medical Journal) concluye que todas estas restricciones alimentarias tienen el mismo efecto y que estos beneficios no duran más de un año.

Los resultados, obtenidos a partir de 22.000 pacientes co

In [64]:
people_unique = people_stem.reset_index().groupby('full_name').count().reset_index().sort_values(by='article_id', ascending=False)
people_unique.shape

(3215, 3)

In [65]:
people_stem['full_name'].nunique()

3215

We have now reduced from 3360 to 3214 unique people

### Remove historical people, politicians and sports people

Remove historical people, kings and politicans

In [66]:
def is_king(name):
    ''' Detects if the name has a roman numeral after a non-roman numeral and returns True/False
    '''
    is_king = False
    # set pattern for regex: from I to XXX including V
    pattern = re.compile(r"""   
                            ^(X{0,3})?
                            (IX|IV|V?I{0,3})?$
        """, re.VERBOSE)
    for idx,word in enumerate(name.split()):
        if idx < len(name.split()) -1 :
            if not re.match(pattern, name.split()[idx]):
            #is_roman_number(name.split()[idx]) == False:
            #if is_roman_number(name.split()[idx]) == False:
                if re.match(pattern, name.split()[idx + 1]):
                #if is_roman_number(name.split()[idx + 1]) == True:
                    is_king = True
    return is_king

In [67]:
people_unique['is_king'] = people_unique['full_name'].apply(lambda x: is_king(x))

In [68]:
people_unique.loc[people_unique['is_king']==True]

Unnamed: 0,full_name,article_id,gender,is_king
463,Carlos III,7,7,True
1663,Juan Carlos I,3,3,True
135,Alfonso X el Sabio,2,2,True
2295,Mohamed VI,1,1,True
1975,Malcolm X,1,1,True
2772,S XVII,1,1,True
2616,Ramsés II,1,1,True
2617,Ramsés V,1,1,True
462,Carlos II el Hechizado,1,1,True
464,Carlos III de Madrid,1,1,True


In [69]:
kings = people_unique.loc[people_unique['is_king']==True]['full_name'].to_list()
people_stem = people_stem.loc[~people_stem['full_name'].isin(kings)]
people_unique = people_stem.reset_index().groupby('full_name').count().reset_index()
people_unique['words'] = people_unique['full_name'].apply(lambda x: len(x.split()))
people_unique['chars'] = people_unique['full_name'].apply(lambda x: len(x))

In [70]:
people_unique[people_unique['full_name'].str.contains('Carlos')]

Unnamed: 0,full_name,article_id,gender,words,chars
445,Carlos Andradas,2,2,2,15
446,Carlos Barrera,1,1,2,14
447,Carlos Blanco,1,1,2,13
448,Carlos Closa,1,1,2,12
449,Carlos Collado,1,1,2,14
450,Carlos Díaz del Río,1,1,4,19
451,Carlos Fernando Jung,1,1,3,20
452,Carlos Fernández,2,2,2,16
453,Carlos Ferrater,1,1,2,15
454,Carlos Fraile,1,1,2,13


Let's explore the names appearing most often. We see some of them are politicians, which we can remove from the dataset (below).

In [71]:
people_unique.iloc[:50]

Unnamed: 0,full_name,article_id,gender,words,chars
0,11 35La,1,1,2,7
1,14 57Todo,1,1,2,9
2,A Estrada,1,1,2,9
3,A Pérez HerreraComo profesora,1,1,4,29
4,A afarensis,1,1,2,11
5,ANTONIO DIÉGUEZTras,1,1,2,19
6,ANTONIO LUCASSe,1,1,2,15
7,Abdoulaye Djimdé,1,1,2,16
8,Abel Caballero,1,1,2,14
9,Abenza Rojo,1,1,2,11


In [72]:
people_unique[people_unique['full_name'].str.contains('Macron')]

Unnamed: 0,full_name,article_id,gender,words,chars
814,Emmanuel Macron,1,1,2,15


'Jose María Martínez' appears 4 times mentioned, and despite it is a famous football player, in this dataset it refers to STEM names (can be checked by looking at the body text of the articles ids below)

In [73]:
people_stem.loc[people_stem['full_name']=='José María Martínez'].index

Index(['6870f1320484210b7a3c134d8bdebb92', 'b855158d720063fd5ac3d296a47ecfe5',
       'bb760b3a10e724afa07e5cf2c84234d4', '5981447d8b090734de8058d4ecf54763'],
      dtype='object', name='article_id')

In [74]:
print_article('5981447d8b090734de8058d4ecf54763')

TITLE: «Esta facultad es superendogámica; todos son padres, hermanos y hasta amantes»

Se sienta en la biblioteca de la Facultad de Odontología de la Universidad Complutense y es abordada por ABC. Al principio, se niega a dar declaraciones, cuando finalmente se decide, le empieza a temblar la voz. No quiere hablar pero algo la obliga: denunciar las injusticias. No quiere que la escuchen, tampoco quiere dar su nombre y le propone a ABC, «María». « Esa facultad es super endogámica, son todas familias , aquí hay padres, hermanos y hasta amantes...Siempre ha sido así ¿Has ido a la universidad? Entonces sabes cómo funciona la endogamia en la universidad española, tanto en la pública como en la privada y en todas partes...». « La odontología es un negocio , no hay un MIR como los médicos, para formarte hay que hacer másteres que son carísimos, la gente se mata por entrar, vale todo...». Su mensaje está cargado de furia y también de ironía.

Ella habla de la endogamia de una forma diametralme

In [75]:
to_remove_keywords = ['Rey Juan Carlos', 'Isabel Díaz Ayuso', 'Felipe González', 'Mariano Rajoy', 'Pedro Sánchez', \
                 'José María Aznar', 'Donald Trump', 'Barack Obama', 'Vladímir Putin', 'Angela Merkel', 'Emmanuel Macron', \
                'Adolf Hitler', 'EL MUNDO', 'Pablo Iglesias', 'Juan Carlos', 'Quim Torra', 'Hillary Clinton', \
                      'Hacienda Pública', 'Sherlock Holmes', 'Gustavo Adolfo Bécquer', 'Alfonso Alonso', 'Rita Barberá', \
                     'Big Vang', 'Francisco Camps', 'Cristina Cifuentes', 'Antropología Física', 'Cristiano Ronaldo', \
                     'Papa Francisco', 'Diana Quer', 'François Mitterrand', 'Michael Jackson', 'Cristóbal Colón', \
                     'Nicolás Maduro', 'Camilo José Cela', 'Albert Rivera', 'Salvador Illa', 'Alberto Núñez Feijóo', \
                     'Esperanza Aguirre', 'Matemática Discreta', 'Matemática Aplicada', 'Rovira i Virgili', \
                     'Xi Jinping', 'Kim Jong-un', 'Kim Jong Un', 'Carles Puigdemont', 'Carme Forcadell', 'Mario Draghi', \
                      'Theresa May', 'Florentino Pérez', 'Kim Il-sung', 'Lewis Hamilton']
to_remove = people_unique[people_unique['full_name'].isin(to_remove_keywords)]['full_name'].to_list()
people_stem = people_stem.loc[~people_stem['full_name'].isin(to_remove)]
people_unique = people_stem.reset_index().groupby('full_name').count().reset_index()

In [76]:
people_unique.shape

(3150, 3)

We have now a total of 3151 names collected. The names that appear most often are indeed related with STEM (see below):

In [77]:
people_unique.sort_values(by='article_id', ascending=False)[0:50]

Unnamed: 0,full_name,article_id,gender
1026,Gay de Liébana,12,12
2189,Mies van der Rohe,5,5
2050,Mark Zuckerberg,5,5
930,Florentino Felgueroso,4,4
1597,José María Martínez,4,4
1596,José María Madiedo,4,4
1325,Javier López,3,3
2747,Santiago Calatrava,3,3
2325,Norman Foster,3,3
1587,José Manuel Moreno,3,3


In [78]:
print_article('06ccd66f526a353bc68765e41e6595f8')

TITLE: La polémica en el Teatro Real pone la cultura en el punto de mira

¿Puede haber mayor sacrilegio que vociferar en un teatro lírico mientras en el escenario los cantantes tratan de entonan una ópera de Verdi? Eso que no cabe en la imaginación de ningún amante de la música –y ciertamente no del maestro Nicola Luisotti que se encontraba en el foso– sucedió el domingo en la tercera función de Un ballo in maschera , el título con el que el Teatro Real ha abierto la temporada y con el que se ha propuesto dar un paso más hacia la tan deseada normalidad.

Desearla no significa, sin embargo, que podamos gozar de ella. Y por mucho que la normativa sanitaria no sea ya la misma que regía el pasado mes de julio al estrenar el coliseo madrileño La Traviata , a día de hoy la gente sigue asumiendo que mantener la distancia de seguridad es crucial, también en teatros y salas de cine y música. Si el verano había comenzado con la norma del metro y medio de separación, siendo ahí voluntaria la masc

## 6. Compute gender mentions in each article <a class="anchor" id="gender-mentions"></a>

In [79]:
def compute_stem_gender_mentions(articles_dataframe, people_dataframe):
    ''' Calculate number of different women and men that are mentioned in each article
    '''
    
    articles_dataframe['mentions_STEM_women'] = articles_dataframe.apply(lambda x: \
                        len(people_dataframe.loc[(people_dataframe.index==x.name) \
                        & (people_dataframe.gender == 'female')]), axis=1)

    articles_dataframe['mentions_STEM_men'] = articles_dataframe.apply(lambda x: \
                        len(people_dataframe.loc[(people_dataframe.index==x.name) \
                        & (people_dataframe.gender == 'male')]), axis=1)
    return articles_dataframe

In [80]:
articles = compute_stem_gender_mentions(articles, people_stem)

## 7. Evaluation <a class="anchor" id="evaluation"></a>

### Evaluate gender assignment

Let's take 150 names from all the categories found by Gender-Guesser, 50 male and 50 females, 25 unknown and 25 andy

In [81]:
men = people_stem.loc[people_stem['gender']=='male'].sample(50)
women = people_stem.loc[people_stem['gender']=='female'].sample(50)
unknown = people_stem.loc[people_stem['gender']=='unknown'].sample(25)
andy = people_stem.loc[people_stem['gender']=='andy'].sample(25)
gender_eval = pd.concat([men, women, unknown,andy])
gender_eval['y_pred'] = gender_eval['gender'].apply(lambda x: 1 if x=='male' else 2 if x=='female' else 3 if x=='unknown' else 4)

I will save this sampled dataset into an excel where I will check all names and add a column 'y_true' to later calculate accuracy, precision and recall

In [82]:
gender_eval.reset_index()[['full_name','gender','y_pred']].to_excel('gender_classification_evaluation.xlsx')

In [83]:
gender_eval_pred_true = pd.read_excel('gender_classification_evaluation_filled.xlsx')
gender_eval_pred_true

Unnamed: 0.1,Unnamed: 0,full_name,gender,y_pred,y_true
0,0,Jan Koum,male,1,1
1,1,Ernest Jones,male,1,1
2,2,Xavier Marcet,male,1,1
3,3,Daniel García,male,1,1
4,4,Miguel Ángel Miranda,male,1,1
...,...,...,...,...,...
145,145,Wu Lei,andy,4,4
146,146,Pin ochet,andy,4,3
147,147,Hua Hua,andy,4,4
148,148,dalái lama,andy,4,4


In [84]:
target_names = ['men', 'women', 'unknown', 'andy']
print(classification_report(gender_eval_pred_true['y_true'], gender_eval_pred_true['y_pred'], target_names=target_names))

              precision    recall  f1-score   support

         men       1.00      0.85      0.92        59
       women       0.92      0.92      0.92        50
     unknown       0.56      0.61      0.58        23
        andy       0.72      1.00      0.84        18

    accuracy                           0.85       150
   macro avg       0.80      0.84      0.81       150
weighted avg       0.87      0.85      0.86       150



### Evaluate article classification as STEM or non-STEM related

From the total of articles in the dataset, I will manually review a subset of 50 articles that contain a detected quote from STEM people and 50 articles that do not contain any detected quote. This way I will calculate precision, recall and accuracy as in the case of gender assignment of collected names just shown above.

In [85]:
articles_science = articles.loc[(articles['mentions_STEM_women'] > 0) | (articles['mentions_STEM_men'] > 0)]
articles_science['category'] = 'STEM'
articles_science['y_pred'] = 1
articles_science.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  articles_science['category'] = 'STEM'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  articles_science['y_pred'] = 1


(1845, 11)

In [86]:
articles_no_science = articles.drop(articles_science.index.values)
articles_no_science['category'] = 'NO-STEM'
articles_no_science['y_pred'] = 0
articles_no_science.shape

(24155, 11)

In [87]:
evaluation = pd.concat([articles_science.sample(50, random_state=42),articles_no_science.sample(50, random_state=42)])
evaluation.shape

(100, 11)

As done earlier to evaluate the accuracy of automatic gender assignment of collected STEM names, I will now export the evaluation dataframe to an excel file to manually review and annotate the true category of the articles.

The review is done as follows:
- For articles in the STEM category: collected sentences are checked to detect if a STEM person is being mentioned in that sentence. Even if the article is about fashion, sports or politics, if in the sentence the collected STEM person is quoted because of his/her STEM activity as expert or academic authority, then it is preserved as STEM category.
- For articles in the NO-STEM category: titles are checked and assigned to STEM or NO-STEM if the article title is prominently about science or a STEM field.

In [88]:
def save_article_classification_evaluation(articles, people_stem, sentences, links, n=0):
    ''' Saves excel file with article title, STEM people names, quoted sentences and link
    
    INPUT:
        articles (dataframe): contains articles and stem mentions
        sentences (list): quotes by STEM people
        people_stem (list): names of STEM people collected
        n = number of rows to show
    '''
    eval_df = pd.DataFrame()
    count = 1
    if n!=0:
        articles = articles[:n]
    for article_id, row in articles.iterrows():
        links_article = links.loc[links.index==article_id]['link'].values
        for item in links_article:
            if not '#' in item:
                link_good = item
        new_row = [count, row['title'], row['frontpage_date'], link_good, \
                   people_stem[people_stem.index==article_id].values.tolist(),
                  sentences[sentences.index==article_id].values.tolist(), row['category'], row['y_pred']]
        if count == 1: # in the first row, create dataframe
            eval_df = pd.concat([eval_df, pd.DataFrame(data=[new_row], \
                                columns=['#','title', 'date', 'link', 'people_stem', 'sentences', 'category', 'y_pred'])])
        else: # after the first row, append row to dataframe
            eval_df.loc[eval_df.shape[0]] = new_row
        count = count+1
    # save excel
    eval_df.to_excel('article_classification_evaluation.xlsx')

In [89]:
save_article_classification_evaluation(evaluation, people_stem, sentences_stem, links)

In [90]:
eval_assessment = pd.read_excel('article_classification_evaluation_filled.xlsx')

In [91]:
eval_assessment.head()

Unnamed: 0.1,Unnamed: 0,#,title,date,link,people_stem,sentences,category,y_pred,y_true
0,0,1,Por qué este año habrá más concursos de habili...,2020-01-10,https://elpais.com/cultura/2020/01/09/televisi...,"[['Ana González Neira', 'female']]","[['Ana González Neira, profesora de la Faculta...",STEM,1,1
1,1,2,La cobertura del móvil de Diana Quer desmonta ...,2019-11-20,https://www.elmundo.es/espana/2019/11/20/5dd56...,"[['El Chicle', 'female']]",[['La tesis de los investigadores y las acusac...,STEM,1,0
2,2,3,"Emilio Lledó: ""Sin las humanidades, nada es po...",2018-03-28,http://www.elmundo.es/cultura/literatura/2018/...,"[['Emilio Lledó', 'male']]",[['Es sólo el resultado de mi experiencia como...,STEM,1,0
3,3,4,Un enorme meteorito cae sobre Brasil - ELMUNDOTV,2020-04-25,http://videos.elmundo.es/v/wf55f9nxzXk-un-enor...,"[['Carlos Fernando Jung', 'male']]","[['El profesor Carlos Fernando Jung, del Obser...",STEM,1,1
4,4,5,Suecia: Leer y morir | Opinión | EL PAÍS,2019-01-03,https://elpais.com/elpais/2019/01/03/opinion/1...,"[['Gunnar Asplund', 'male']]",[['Suscríbete aquí\n\nLa construcción de la Bi...,STEM,1,1


In [92]:
target_names = ['no STEM', 'STEM']
print(classification_report(eval_assessment['y_true'], eval_assessment['y_pred'], target_names=target_names))

              precision    recall  f1-score   support

     no STEM       0.92      0.81      0.86        57
        STEM       0.78      0.91      0.84        43

    accuracy                           0.85       100
   macro avg       0.85      0.86      0.85       100
weighted avg       0.86      0.85      0.85       100



Some examples of true positives:

>Cien años es un número redondo, capta nuestra atención, pero no es el dato más importante, ni mucho menos», señala **Miguel Ángel Vázquez**, médico geriatra, investigador de la longevidad y presidente de la Sociedad Gallega de Gerontología y Geriatría
>
Source: ["El secreto de la longevidad: los 'inmortales' de Ourense"](http://elpais.com/economia/2017/11/15/actualidad/1510772537_869262.html) (03/02/2020)

>Los investigadores también realizaron varios modelos para comprender la influencia de la actividad volcánica. Pero existe otra, defendida por algunos investigadores, que dice que el meteorito no fue el único culpable de la gran extinción del Cretácico-Paleógeno (K-Pg), sino que unas erupciones volcánicas masivas en la India, en la región conocida como las escaleras del Decán, contribuyeron al exterminio.<br>
>**Laia Alegret**, en su laboratorio U. Zaragoza.
>
Source: [Nada más que el asteroide mató a los dinosaurios](http://abc.es/ciencia/abci-nada-mas-asteroide-mato-dinosaurios-202001162000_noticia.html) (16/01/2020) 

## 8. Export to csv for statistics <a class="anchor" id="export"></a>

In [96]:
#articles.to_csv('data/sample50k_articles_stats.csv', sep='‰')
articles[['site','frontpage_date','mentions_STEM_women','mentions_STEM_men']].to_csv('data/sample50k_articles_stats.csv', sep='‰')

In [94]:
people_stem.to_csv('data/sample50k_people_stats.csv', sep='‰')