# Top articles in Wikipedia 2019
Author: Gabriel Velástegui

The original idea comes from this [article](https://towardsdatascience.com/interactive-the-top-2019-wikipedia-pages-d3b96335b6ae) from Felipe Hoffa published on January 2020. The article focuses on the top articles of the English Wikipedia in 2019. I decided to do a slightly different version by checking the top ten articles by each month. I also compare the results between four different languages: English, Spanish, Russian and German.

There are some tools that allow to analyze articles' stats from Wikipedia, like [WikiShark](https://www.wikishark.com/), [Wikimedia stats](https://stats.wikimedia.org/) or [Pageviews Analysis](https://iw.toolforge.org/pageviews). This last one is specially interesting, because of the wide range of options to visualize articles' metrics. As I was exploring the results of the most viewed articles from the Spanish Wikipedia, I found Hoffa's article and decided to explore this dataset from 2019 for different languages.

I used Python for this project. This notebook focuses more on data analysis and visualization. The data import process is explained in the repository on Github.

#### Dependencies:
- Repository: https://github.com/gra-vel/wiki-2019
- Python: pandas, matplotlib, seaborn, plotly

In [1]:
import wiki_analysis
import wiki_visual
import plotly.offline as py
py.offline.init_notebook_mode(connected=True)

### Data

Initially, the raw imported data from Wikipedia’s API had certain terms that did not add much to the analysis. Terms such as “Wikipedia:” or “Main_Page” can give an insight into the way users interact with the webpage overall. However, the purpose of the project is to find which individual articles are the most viewed. Therefore, all these terms were removed from the dataset.

The main issue with the data is that some of the articles’ views may come from spam, botnets or as result of errors. Wikipedia allows to retrieve data based on its origin (user, spider or automated). However, spam or botnets could be able to pass as a user, so its views will be counted as coming from human users, skewing the overall total at the end. This is why the main task in this project was to find a way to figure out when the views came from legitimate users or automated programs, considering the available information from the API.

One way I was able to identify possible automated views was by checking daily views by access (desktop, mobile-web and mobile-app). It is possible to identify certain inconsistencies between access methods that could point out to automated views as shown in the data analysis section.

In [8]:
# importing final data
wiki_es = wiki_analysis.Wiki_all_access("2019_es_wikidaily.csv", "latin1") #spanish
wiki_en = wiki_analysis.Wiki_all_access("2019_en_wikidaily.csv", "utf-16") #english
wiki_de = wiki_analysis.Wiki_all_access("2019_de_wikidaily.csv", "utf-16") #german
wiki_ru = wiki_analysis.Wiki_all_access("2019_ru_wikidaily.csv", "utf-16") #russian

### Most viewed articles in Wikipedia for 2019

I decided to use plotly for this visualization, because it allows to create interactive plots. I use bottons for the entire year and one for each month. The plot for the entire year looks cluttered, but it allows to compare the highest number of views in one day. When looking individual months, it is easier to distinguish specific trends for a subset of articles. Moreover, in the first plot, it is also possible to see trends for articles that were among the top 10 for several months.

#### Most notable trends

- Two categories stand out in these four languages: entertainment and deaths of notable individuals. Several top articles are mostly related to recently released movies and TV series. There is also a high number of views for articles of famous people just after they passed away.

In [9]:
wiki_visual.lang_plot(wiki_en.get_df(), 'total', 'English')

- English Wikipedia is dominated by articles related to movies and TV series. However, three out of the five most viewed articles in a single day are from celebrities that recently died: *Cameron Boyce*, *Nipsey Hussle* and *Luke Perry*.


- Articles related to recently released movies usually have a clear trend indicating that they are mostly accessed during the weekends. This makes sense, since movie theaters get crowded during this time of the week, so people is reading more about the movie closely after seeing it. This trend is identified with *Freddie Mercury* (although the movie is titled Bohemian Rapsody), *Avengers: Endgame*, *Captain Marvel (film)*, *Shazam! (film)*, *Spider-Man: Far From Home*, *Once Upon a Time in Hollywood*, *Joker (2019 film)*.


- For movies, TV series and documentaries in streaming services, the same trend is also there, although there are some exceptions.

In [4]:
wiki_visual.lang_plot(wiki_es.get_df(), 'total', 'Spanish')

In [6]:
wiki_visual.lang_plot(wiki_ru.get_df(), 'total', 'Russian')

In [7]:
wiki_visual.lang_plot(wiki_de.get_df(), 'total', 'German')