## 1. Why Pandas?
### Working with (semi)structured data

The data types and functions covered in the previous Notebook belong to the standard Python library. It is the backbone of the Python language, but the operations they perform (and the structures they represent) remain rather basic. As writing code to handle more complex application would be laborious, Python comes with various libraries that offer myriad data science tools. 

Below, I give the example of opening CSV files with Pandas (the Python data science library we will be using in this course).

### 1.1. CSV files

CSV stands for "Comma Separated Values": a data format in which the cells are separated by commas.

For example, the CSV table below has **columns** A,B,C and **rows** 1 to 3:

``
,A,B,C
1,0,1,1
2,1,0,1
3,1,1,1
``

The CSV above is an example of **structured** data: each element is properly identified by the **header** (columns) and the **index** (rows).

We can represent the CSV file as a Python string. The newlines are represented in Python by the '\n' character. The cell below creates a CSV table and prints it.

In [None]:
csv_string = ',A,B,C\n1,0,2,5\n2,8,3,1\n3,2,9,1'
print(csv_string)

Using index notation (remember the square brackets?) we can select certain elements, both single characters and slices.

In [None]:
print(csv_string[11]) # print character at index 11
print(csv_string[11:14]) # print slice starting at character with index 11 up to index 14

Index notation is handy, but the structure of the CSV file remains implicit--i.e. we can not ask Python to give us the integer in column B at row 2. To recognize the structure properly, Python needs to **parse** the document. While this can be done with standard Python tools, there are external libraries that help you with handling structured data.

In the course, we rely on Pandas, Python's main data science library. Before we can use Pandas, we have to load it using the import statement (which means as much as "please load all tools hidden in de Pandas folder").

In [None]:
import pandas # this imports pandas tools

After importing the Pandas tools we can use the `read_csv` function to read the data.

In [None]:
from io import StringIO # Ignore this line, this technicality is not important for this course
df = pandas.read_csv(StringIO(csv_string),index_col=0) # transform the CSV file as a Pandas DataFrame
df

### Exercise

What is the data type of the `df` variable?

In [None]:
# Insert code here

### --Important!--

Please note the following:
- The syntax for using tools from external libraries adheres to the following structure:
        `<library>.<function>(<arguments>)`
- As you noticed, the CSV file is not a string object but belongs to the DataFrame class (which we will inspect below)

Because Pandas identifies the rows and columns automatically, we can ask more meaningful and elaborate questions to extract specific cells or other elements from the DataFrame.

#### Exercise: 

To access specific areas in our DataFrame, we need to use the **`.loc[:,:]`** indexer. The syntax, here, is similar to the index notation we studied previously but applies to two dimensions (rows *and* columns) instead to just one. These dimensions are separated by a comma. While index notation allowed you to retrieve an item from a sequence by location, `.loc[]` is more flexible: you can simultaneously define the row and columns you'd like to access. To understand how this works guess which part of the table the statements below will print. 

#### Retrieving one cell

In [None]:
print(df.loc[2,'A'])

In [None]:
print(df.loc[1,'B'])

#### --Exercise--

Print the integer in row 3 and column C.

In [None]:
# insert code here

Multiply the integer in row 3 and column C with the integer in row 1 and column B.

In [None]:
# insert code here

#### Retrieving slices

In [None]:
print(df.loc[2,:])

In [None]:
print(df.loc[:,'A'])

It is also possible to obtain multiple columns. Note that the column names are then enclosed within square brackets.

In [None]:
print(df.loc[:,['A','B']])

Print the integers from row 1 to 2 in columns A and C.

In [None]:
# insert code here

We can also print all columns for a specific row:

In [None]:
print(df.loc[1,:])

By now it should be clear that Pandas' `.read_csv()` facilitates the exploration of structured data (compared to strings in combination with index notation that are part of the Python standard library).

For sure this a toy example, the power of these tools will become clear when working with real data.

## 2. Reading YouTube and Facebook data with Pandas

In this course, we focus on YouTube and Facebook data obtained via [DMI](https://tools.digitalmethods.net) tools.

For YouTube, we use [this collection of tools](https://tools.digitalmethods.net/netvizz/youtube/). For Facebook, we rely on [Netvizz](https://apps.facebook.com/107036545989762/). We inspect both tools later in more detail. 

To demonstrate the Pandas functionalities, we will work with comments from [this](https://www.youtube.com/watch?v=-OLEyOYC6P4) documentary on techno music.

In [None]:
# Run this cell to watch the YouTube clip.
from IPython.display import YouTubeVideo
YouTubeVideo('-OLEyOYC6P4')

All the comments on this video are saved as TSV (Tab Separated Values) file and stored [here](https://github.com/kasparvonbeelen/CTH2019/blob/master/data/videoinfo_-OLEyOYC6P4_2018_12_18-09_26_24_comments.tab). 

In Python, tab spaces are represented as `\t`. Compare the print statements below to see the difference.

In [None]:
print("Kaspar Beelen")
print("Kaspar\tBeelen")
print("Kaspar\t\tBeelen")

Below we store the URL (that will bring you to the raw data) in the `file_url` variable.

In [None]:
file_url = 'https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/data/videoinfo_-OLEyOYC6P4_2018_12_18-09_26_24_comments.tab' 

For convenience, we will use an abbreviation to refer to the Pandas library. We use `pd` to refer to the pandas library, this will save us some typing work.

The code below imports Pandas use `pd` an abbreviation.

In [None]:
import pandas as pd

Pandas is smart enough to automatically download information from the Web. We just have to pass it a URL. Secondly, we define the delimiter (a character which separates the individual values) with the `sep=` argument.

In [None]:
df = pd.read_csv(file_url, sep='\t')

With these few lines, you managed to lead the whole corpus of comments into your notebook.

The `.head()` method allows you to inspect the table.

In [None]:
df.head(3)

**Exercise**: print the first 10 rows

In [None]:
# insert code here

**Question**: What information does `df` contain?

`data.tail()` prints the last rows

In [None]:
df.tail()

To determine the dimensions of the dataframe--the number of rows and columns--use `.shape` attribute.
> IMPORTANT: As opposed to methods, attributes do not end with parantheses. If methods can be understood as the verbs of the Python language, the attributes are the adjectives, they tell you something about the object they are attached to but do not modify it.

In [None]:
df.shape

#### \*\*\*Exercise

Now, let's look at some Facebook data: 50 post retrieved from the New York Times' Facebook page.

- Load the data you find [here](https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/data/page_5281959998_2018_12_28_22_00_39_fullstats.tab) in Pandas dataframe. Assign it to a variable with name `df`.
> HINT: As in the example above, simply use the URL to retrieve the data. 

In [3]:
# insert code here

What are the dimensions of the DataFrame?

In [None]:
# insert code here

Print the column names using the `.columns` attribute.

In [None]:
# insert code here

Show the first ten rows (using `.loc[]`).

In [None]:
# insert code here

And the last ten rows (using `.loc[]`)
> HINT: `.loc[]` does not allow negative indexing, so use .shape to know the dimensions of your DataFrame

In [None]:
# insert code here

Compare the columns with the YouTube example above: What are the differences/similarities?

Print all columns for row 25

In [None]:
# insert code here

Print the post message for rows 25 to 30.

In [None]:
# insert code here

Print the post message, angry and sad reactions columns for rows ten to twenty.

In [None]:
# insert code here

Great! Now you've managed to load your social media data into you Notebook and access specific information. What's next? First, let's explore some of the general statistics.

## Inspecting the DataFrame

`describe()` offers you a quick look into the summary statistics of the dataset.

In [None]:
df.describe()

**Question**: Which **columns** are not shown in this summary. Can you explain why?

**Question**: Can you determine the meaning of the **rows**? If not, have a look at the Wikipedia pages for the [Mean](https://en.wikipedia.org/wiki/Mean), [Median](https://en.wikipedia.org/wiki/Median), [Standard Deviation](https://en.wikipedia.org/wiki/Standard_deviation) and [Percentiles](https://en.wikipedia.org/wiki/Percentile).

For more information about the `.describe()` we can access the help function using `??` at the end of the line.

In [None]:
df.describe??

The mean and the median are used to summarise a list of values, for exampe [1,2,4,5,8].
The median is the number you get when sorting the list in ascending order, and picking the number in the middle.


**Question**: compute the median value of 2,4,8,9,0.

The means is obtained by summing up all values and dividing this sum by the total number of values. 


**Question**: what is the mean for the series  2,4,8,9,0?

Luckily Python's Numpy library provides you with functions to compute the median and the mean. 

#### Exercise

Before running the commands below we have load Numpy. Similar to the Pandas example before, load the numpy library using `np` as an abbreviation.
> Hint: Use the import *library* as *abbreviation* syntax

In [None]:
# insert code here

Then, to check your previous answers, run the cells below:

In [None]:
np.median([2,4,8,9,0])

In [None]:
np.mean([2,4,8,9,0])

In the examples above, the mean and the median happen to be very close to each other, but this is not always the case.

**Question**

Run the cells below. Can you explain why the median and mean diverge? Which score provides the best summary for the list of numbers? Which metric (mean or median) is more reliable? Or does it depend on the distribution of the values?

In [None]:
results = [0,0,1,1,2,4,5,6,6,7,8,7]
print('The mean is ',np.mean(results))
print('The median is ',np.median(results))

In [None]:
results = [0,0,1,1,2,4,5,6,6,7,8,70000]
print('The mean is ',np.mean(results))
print('The median is ',np.median(results))

In [None]:
results = [0,0,0,0,0,0,0,6,6,7,8,7]
print('The mean is ',np.mean(results))
print('The median is ',np.median(results))

In [None]:
results = [0,0,0,0,0,0,0,6000,6000,7000,8000,7000]
print('The mean is ',np.mean(results))
print('The median is ',np.median(results))

As these toy examples show, both the median and the mean can give a misleading impression of the actual distribution of values. Oftentimes, it is good practice the visualise an array of numbers. Pandas gives you many tools for visualising data. But before we can do this, we have to instruct Python to plot all figures in the Notebook by running the so-called "magic" command.

In [2]:
%matplotlib inline

Before we can plot data--for example the distribution of the total reactions to New York Times Facebook posts--we have to select the column we want to investigate. This can be done with the following line of code.

In [None]:
df.loc[:,'reactions_count_fb']

Or simpler:

In [None]:
df['reactions_count_fb']

We can easily compute the mean and median for columns by applying the `.mean()` and `.median()` methods. Inspect the example below:

In [None]:
df['reactions_count_fb'].mean()

#### Exercise

- What is the median value for the `reactions_count_fb` column?
- What is the mean of the angry reactions column?
- What is the median haha reactions?
> Try to print these scores nicely, as in: print('the average number of reactions is', ...)

In [None]:
# insert code here

After obtaining these summary statistics we proceed with visualising the actual distribution of the data. We can thus by plotting a histogram, which shows us how frequent values (that fall within a certain range) occur:

In [None]:
df['reactions_count_fb'].plot(kind='hist')

**Question** 

How to interpret this figure?

We can refine the bars on the figure by adjusting the `bins` argument:

In [None]:
df['reactions_count_fb'].plot(kind='hist',bins=100)

#### Exercise

- Plot the distributions of the like counts.
- Adjust the number of bins: use 10,100,1000.
- Which value would you prefer?

In [None]:
# insert code here

## Sorting DataFrames

Lastly, we can sort the DataFrame by a specific column. De `.sort_values()` method takes two argument
- `by=`: the column you column you want use
- `ascending=` should be equal to `False` if you want to rank the table from high to low.

In [None]:
df.sort_values(by='rea_HAHA',ascending=False)

You can use this in combination with `head()` if you want to show only the first *n* rows

In [None]:
df.sort_values(by='rea_HAHA',ascending=False).head(10)