# Tutorial 6 - Recopilar datos desde Wikipedia

## 1. Wikipedia

https://github.com/goldsmith/Wikipedia

Una librería Python que encapsula [MediaWiki API](https://www.mediawiki.org/wiki/API:Main_page) para facilitar el acceso a los datos de Wikipedia: artículo, resumen, enlaces, imágenes, titulo, etc.

In [None]:
#!pip install wikipedia

In [None]:
import wikipedia

- Hacer una consulta (devuelve las páginas wikipedia que corresponden a la consulta):

In [None]:
wikipedia.search("Trump")

- Ver el contenido de una página:

In [None]:
page = wikipedia.page("Donald Trump")

In [None]:
page.title

In [None]:
page.url

In [None]:
page.content

In [None]:
page.links

In [None]:
wikipedia.set_lang("es")

In [None]:
page = wikipedia.page("Donald Trump")
page.content

In [None]:
wikipedia.summary("Donald Trump", sentences=1)

In [None]:
page.images

In [None]:
import requests
import IPython.display as Disp
url = page.images[6]
Disp.Image(requests.get(url).content, width = 400) #height = 50

**Alternativas a la libreria 'wikipedia':**
    - https://pypi.org/project/Wikipedia-API/
    - https://en.wikipedia.org/wiki/Help:Creating_a_bot#Python

## 2. Wikipedia Page views

https://github.com/Commonists/pageview-api

Estadísticas sobre el número de visitas de los artículos wikipedia.

In [None]:
#!pip install git+https://github.com/Commonists/pageview-api.git

In [None]:
import pageviewapi

- ¿Cómo ha evolucionado el número de vistas diarias de la página de Donald Trump entre dos fechas?

In [None]:
result1=pageviewapi.per_article('es.wikipedia', 'Donald Trump', '20201101', '20201110',
                        access='all-access', agent='all-agents', granularity='daily')
result1

In [None]:
import pandas as pd

df1 = pd.DataFrame()

for item in result1.items():
    for article in item[1]:
        timestamp=article['timestamp'][:8] #first 8 digits
        a_row = pd.Series([timestamp, article['views']])
        row_df = pd.DataFrame([a_row])
        df1 = pd.concat([df1, row_df], ignore_index=True)
        
df1.columns =['timestamp', 'views'] 
df1

In [None]:
from matplotlib import pyplot

df1.plot(x='timestamp')
pyplot.xticks(rotation=80)
pyplot.show()

- ¿Cómo ha evolucionado el número de vistas mensuales de la página de Donald Trump entre dos fechas?

In [None]:
result2=pageviewapi.per_article('es.wikipedia', 'Donald Trump', '2018101', '20201030',
                        access='all-access', agent='all-agents', granularity='monthly')
result3=pageviewapi.per_article('es.wikipedia', 'Joe Biden', '2018101', '20201030',
                        access='all-access', agent='all-agents', granularity='monthly')

In [None]:
result2

In [None]:
import pandas as pd

df2 = pd.DataFrame()

for item in result2.items():
    for article in item[1]:
        timestamp=article['timestamp'][:8] #first 8 digits
        a_row = pd.Series([timestamp, article['views']])
        row_df = pd.DataFrame([a_row])
        df2 = pd.concat([df2, row_df], ignore_index=True)
        
df2.columns =['timestamp', 'views'] 

df3 = pd.DataFrame()

for item in result3.items():
    for article in item[1]:
        timestamp=article['timestamp'][:8] #first 8 digits
        a_row = pd.Series([timestamp, article['views']])
        row_df = pd.DataFrame([a_row])
        df3 = pd.concat([df3, row_df], ignore_index=True)
        
df3.columns =['timestamp', 'views'] 

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.locator_params(nbins=10, axis='x')

for frame in [df2, df3]:
    plt.plot(frame['timestamp'], frame['views'])

plt.legend(['Trump','Biden']) 
pyplot.xticks(rotation=80)

max_xticks = 15
xloc = plt.MaxNLocator(max_xticks)
ax.xaxis.set_major_locator(xloc)

pyplot.show()

- ¿Cuáles fueron las páginas más vistas el "4 noviembre 2020" en el wikipedia inglés?

In [None]:
import pageviewapi
result=pageviewapi.top('en.wikipedia', 2020, 11, "04", access='all-access')
result

In [None]:
for items in result.items():
    print(items[1])

In [None]:
for article in items[1][0]['articles']:
    print(article)

In [None]:
- ¿Qué busca la gente en el wikipedia español hoy?

In [None]:
- ¿Qué busca la gente en el wikipedia español hoy

In [None]:
import pageviewapi
result=pageviewapi.top('es.wikipedia', 2020, 11, "10", access='all-access')
result

In [None]:
for items in result.items():
    for article in items[1][0]['articles']:
        print(article)

In [None]:
df_top = pd.DataFrame()

for items in result.items():
    for article in items[1][0]['articles']:
        a_row = pd.Series([article['article'], article['views']])
        row_df = pd.DataFrame([a_row])
        df_top = pd.concat([df_top, row_df], ignore_index=True)

df_top.columns =['article', 'views'] 
df_top[2:12]

In [None]:
df_top[2:22].plot.bar(x='article', y='views',rot=90)

## 3. Para experimentar...

- Supongamos que Wikipedia aproxima la **notoriedad pública** de ciertas personas: *¿Qué tan conocida es una persona por los ciudadanos?*)

- Supongamos que los medios de prensa, cuando citan a ciertas personas, les dan una **visibilidad mediática**: *¿Qué tan visible es una persona en los medios de prensa?*

Escribir un script que permite comparar la **notoriedad pública** y la **visibilidad mediática** de algunas personas.

In [None]:
import pandas as pd

DATASET_CSV="../datasets/CNNCHILE_RAW.csv"

df_CNN = pd.read_csv(DATASET_CSV,sep=',',error_bad_lines=False)
df_CNN = df_CNN.drop(['Unnamed: 0'], axis = 1) # Para suprimir la columna ID
df_CNN['date'] = pd.to_datetime(df_CNN['date']) # Para convertir la columna date en formato datetime

df_CNN

In [None]:
from pandasql import sqldf

q="""SELECT * FROM df_CNN WHERE date LIKE "2020-%";"""
CNN_2020=sqldf(q)
CNN_2020

In [None]:
import spacy

nlp = spacy.load('es_core_news_md')

In [None]:
entities={}

for index,row in CNN_2020.iterrows():
    if(index%100 == 0):
        print(index)
    # Text of the news
    text=row[4]
    
    # We apply NLP processing here, in particular Tokenization and Entity Name Recognition
    try:
        doc = nlp(text)
    except:
        continue
    
     # We analyze the entities from the document, and we use only the Person type(PER)
    for ent in doc.ents:
        
        # We check if entity is a Person type using the SpaCy model
        if(ent.label_=="PER"):
            #We tokenize the entity
            tokenized_entity=(ent.text).split(" ") 
            
            # We preserve only the entities that has between 2 abd 4 tokens (Usual name annotation in Chile)
            if ((len(tokenized_entity)>1) and len(tokenized_entity)<=4):
                
                entity_full_name = ent.text
                
                if entity_full_name in entities:
                    entities[entity_full_name] += 1

                else:
                    entities[entity_full_name] = 1

In [None]:
sortedVisibility = sorted(entities.items(), key=lambda x: x[1], reverse=True)
sortedVisibility

In [None]:
sortedVisibility[0]

In [None]:
len(sortedVisibility)

In [None]:
sortedVisibility[0:50]

In [None]:
popularity={}

for entity in sortedVisibility[:50]:
    name=entity[0]
    
    try:
        visits_per_month=pageviewapi.per_article('es.wikipedia', name, '20200101', '20201030', 
                                             access='all-access', agent='all-agents', granularity='monthly')

        ## Sum
        sum=0
        for item in visits_per_month.items():
            for article in item[1]:
                view=article['views']
                sum=sum+view
    
        ## Save
        popularity[name] = sum
        
        print("Hay datos para: "+name+" - "+str(sum))
        
    except:
        popularity[name] = 1
        print("No hay datos para: "+name)
    
    

In [None]:
print(len(popularity))
popularity #qué paginas miran los ciudadanos hispanohablantes en wikipedia

In [None]:
sortedPopularity = sorted(popularity.items(), key=lambda x: x[0], reverse=False)
sortedPopularity

In [None]:
sortedVisibility = sorted(dict(sortedVisibility[0:50]).items(), key=lambda x: x[0], reverse=False)
print(len(sortedVisibility))
sortedVisibility

In [None]:
x = []
y = []
label = []

for person in sortedVisibility:
    name=person[0]
    if (name not in ['Donald Trump', 'Sebastián Piñera', 'Lionel Messi', 'Colo Colo', 'Bad Bunny', 'Barack Obama']):
        visibility=person[1]
        x.append(visibility)
        label.append(name)
    
for person in sortedPopularity:
    name=person[0]
    if (name not in ['Donald Trump', 'Sebastián Piñera', 'Lionel Messi', 'Colo Colo', 'Bad Bunny', 'Barack Obama']):
    #print(name)
        popularity=person[1]
        y.append(popularity)

In [None]:
from math import log

log_x=[log(value) for value in x]
log_y=[log(value) for value in y]

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

plt.rcParams["figure.figsize"]=20,20

fig, ax = plt.subplots()

ax.scatter(x,y)

plt.xlabel("Media Visilibility")
plt.ylabel("Wikipedia Popularity")

for i, txt in enumerate(label):
    ax.annotate(txt,(x[i], y[i]))#,fontsize=60) 


## 4. Para ir un poco más lejo...

A partir del resumen de las páginas wikipedia, construir un script qu