## Part 2: Intro to Data Visualization

Great job getting all the data together!! It was quite some work. Now will come the fun part: diving into the data. This will be a great chance to understand the field of Computational Social Science in a data-driven way and to apply the methods we will talk about in class.

Before we start, we're going to talk about Data Visualization. It is something I deeply care about, because visualizing is important both to analyse data and to show our results to others. 

In the two videos below, I will: (i) introduce general concepts on Data Visualization and (ii) present a few tips and techniques to improve the visual quality of your plots in Python. 

**In the assignments, I expect your plots to be informative, well-designed and clear to intepret. These aspects will be part of the evaluation.**

> * _Video Lecture_: [Intro to Data Visualization](https://www.youtube.com/watch?v=oLSdlg3PUO0)

In [2]:
from IPython.display import YouTubeVideo
YouTubeVideo("oLSdlg3PUO0",width=800, height=450)

Before even starting visualizing some cool data, I just want to give a few more practical tips for making good plots in matplotlib. Unless you feel like you are already a pro-visualizer, those should be pretty useful to make your plots look much nicer. 
Paying attention to details can make an incredible difference when we present our work to others. 

**Note**: there are many Python libraries to make visualizations. I am a huge fan of matplotlib, which is one of the most widely used ones, so this is what we will use for this class. 

> *Video Lecture*: [How to improve your plots](https://www.youtube.com/watch?v=sdszHGaP_ag)

In [3]:
from IPython.display import YouTubeVideo
YouTubeVideo("sdszHGaP_ag",width=800, height=450)

## Part 3: Visualizing distributions

Relying solely on summary statistics like the mean, median, and standard deviation to understand your dataset can sometimes be misleading. It's very good practice, to begin your analysis by visualizing the data distribution. Observing the probability distribution of data points can reveal a wealth of insights.

The problem is that real-world datasets often cover a wide range of values, spanning several orders of magnitude. Hence, basic methods of plotting histograms may not effectively represent these datasets. However, there are techniques to address this challenge and enhance visualization.

In the video lecture below, I demonstrate how to plot histograms for datasets with significant heterogeneity. The techniques are shown using two examples: a financial dataset on stock prices and returns, and data on the number of comments posted by Reddit users. But these methods are universally applicable. You can use them to visualize any type of data.


> *Video Lecture*: [Plotting histograms and distributions](https://www.youtube.com/watch?v=UpwEsguMtY4)

In [4]:
YouTubeVideo("UpwEsguMtY4",width=800, height=450)

> **Exercise 3: Analyzing Paper Citations**  In this exercise, we aim to explore the distribution of citations per author within the field of Computational Social Science. Our objectives are twofold:
> - *Learn to Plot Distributions:* We'll tackle the challenge of visualizing distributions for heterogeneous data, a common scenario in Computational Social Science.
> - *Investigate Author Recognition:* We'll analyze how recognition (measured in citations) varies for Computational Social Scientists from different countries.
>   
> **Dataset:** Use the "Authors dataset" you prepared in Exercise 2.
> 
> **Tasks:**
> 1. **Data Preparation:**
>    - Extract the total number of citations for each author from the dataset and store this information in an array.
> 2. **Plotting the Overall Citation Distribution:**
>    - Use [`numpy.histogram`](https://numpy.org/doc/stable/reference/generated/numpy.histogram.html) to create a histogram of citations per author. Consider the following when plotting your histogram:
>        - **Number of bins:** The default behavior of `numpy.histogram` is to create 10 equally spaced bins. However, you should customize this to suit your data. Experiment with different numbers and sizes of bins to find the most informative visualization. Too few bins may oversimplify your data, while too many can result in a fragmented appearance.
>        - **Linear vs. Logarithmic Binning** Choose the approrpiate binning:
>          - Use *logarithmic binning* for heterogeneous data that has many extreme values (usuall in the right tail), creating bins with `numpy.logspace`.
>          - Else, use *linear binning*, creating bins with `numpy.linspace`.
>      - **Normalization** Where appropriate, you can convert your histogram into a Probability Density Function:
>        - set the `density=True` argument in `numpy.histogram`. This normalizes the histogram so the area under the curve equals 1, providing insights into the probability distribution of citations.
>
> 3. **Comparative Histograms by Country:**
>    - Identify the top 5 countries by the number of authors. For each of these countries, plot the distribution of ciations per author (as a line plot). Overlay these histograms on the same figure for comparison.
>
> 4. **Binning Decision:**
>    - Discuss whether you chose linear or logarithmic binning for the histograms in tasks 2 and 3 and justify your choice.
>
> 5. **Normalization Decision:**
>    - Explain whether you normalized the histograms and why. Describe in your own words the difference between normalized and non-normalized histograms.
>
> 6. **Analysis of Recognition Distribution:**
>    - Analyze the plotted distributions to comment on how author recognition, as indicated by citation numbers, varies among authors in the whole dataset, as well as across the selected countries. In your answer, include a comment on the following aspects: the range of values that the distributions span along the x and y axes; the presence of extreme values or outliers; differences in trends across countries. 


This is the end of today's class :) [And here a little comic to end on a happy note](https://www.reddit.com/media?url=https%3A%2F%2Fi.redd.it%2F8xor77e2nh971.png)