# 4. Grouping and Making Comparisons


## 1. Research Scenario: Content Type and Popularity

Are posts with pictures more popular than other types of content? To answer this question we have to compare the content type in relation to the likes they receive. The code below shows how to do this. 

#### Exercise

Load posts by the comedian [Andy Borowitz](https://www.facebook.com/andyborowitz/). Data is [here](https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/data/page_38423635680_2019_01_13_12_33_40.tab)

In [1]:
%matplotlib inline
import pandas as pd
url = 'https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/data/page_38423635680_2019_01_13_12_33_40.tab'
data = pd.read_csv(url,sep='\t')

#### Exercise

Print the column names and the first ten rows. Which column contains information about the post's content type?

In [None]:
# insert code here

Apply the `.unique()` method to this column to see which values the column contains.

In [None]:
# insert code here

Of course, the "type" column lists indicates if the post contained a picture or not. With Pandas we can easily study whether, on average, posts with pictures receive more reactions than other content types. Run the code, later on, I will explain how it works.

In [None]:
group_means = data.groupby('type')['reactions_count_fb'].mean()
group_means

A visual represention is often more insightful, so let's make a barplot to compare the means.

In [None]:
%matplotlib inline
group_means.plot(kind='bar')

Plotting the actual distribution is also possible.

In [None]:
data.groupby('type')['reactions_count_fb'].plot(kind='density',legend=True)

#### About `.groupby()`

Ok, what happened here?

Basically, the `.groupby()` function groups the table by the column selected within the parentheses (here different content types, e.g. "link", "photo" etc.). Then for each group, we apply a specific calculation (e.g. computing the average or simply summing all the values). The figure below gives a good graphic representation of this process. The only difference is that we used the `.mean()` instead of `.sum()`.
<img src="http://i0.wp.com/datapandas.com/wp-content/uploads/2016/09/pandas-powerful-data-analysis-tools-group-by.jpg?resize=600%2C450">

Basically, `.groupby()` divides the DataFrame by a specific category and then applies a method (such as sum, median or other) to each of these sub-tables before combining the tables again. To see how this works you can inspect the toy example below.

In [None]:
example = pd.DataFrame([['A',1],['B',2],['A',4],['B',1],['A',6],['B',1]],columns=['category','value'])
example

In the code cell below we group results by the "category" and then sums (for each of the sub-tables) the values in the "value" column

In [None]:
example.groupby('category')['value'].sum()

#### Exercise

Group by the "category" column and then compute the mean of 'value'.

In [None]:
# insert code here

#### Exercise

Returning to the Borowitz example: The results suggested that, on average, posts with pictures are more popular. But, as seen in lecture 2, the mean is sensitive to outliers. To check if the results are robust, investigate if the findings change when computing the median by content type.

In [None]:
# insert code here

#### Exercise

Does visual content elicit stronger *emotional* responses? Inspect the angry and love reactions.

In [None]:
# insert code here

#### \*\*\* Exercise

We can use a similar technique to track the most active users on  Social Media platforms. 
- Load comments on the YouTube Techno Documentary we studied in the second lecture (you can use [this table](https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/data/videoinfo_-OLEyOYC6P4_2018_12_18-09_26_24_comments.tab))
- Group the table by users (use "authorName")
- Then apply `count()` to the "id" column
- Assign the result of this operation to a new variable `df_comment_count`
- Sort this DataFrame in descending order by applying the `.sort_values()` method to `df_comment_count` 
- In the latter method, set the `ascending` argument to `False` (i.e. .`sort_values(ascending=False)`)


In [None]:
# insert code here

#### \*\*\* Exercise

- Continuing the Youtube Techno example: Can you also rank the authors by the likes they have **received** (i.e. find the most popular authors). Save the result of the `.groupby()` operation in a new variable with the name `df_like_count`

> HINT: first group by authorName and then apply `sum()` to the likeCount

- Plot the distribution of these likes (grouped by author) with a Histogram or a Density plot. These follow a winner-takes-it-all [Power Law](https://en.wikipedia.org/wiki/Power_law)

> HINT: simply apply `.plot(kind='density'`) to `df_like_count`


In [None]:
# insert code here

## 2. Research Scenario: Lexicon-Based Sentiment Analysis with VADER

In the preceding examples we relied on reactions to study emotional behaviour on social media. Oftentimes, this information is not available, and we only have access the text itself to detect emotion. In this situation we can use automatic sentiment detection tools such as **VADER**.

[from Github](https://github.com/cjhutto/vaderSentiment): VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool.

VADER uses a [lexicon](https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) (a mapping of words to sentiment values, e.g bad=-1.0, good=+1.0) to compute the sentiment (positivity or negativity) of a text.

#### Question

Inspect the VADER lexicon: Can you understand the structure of this table? 

Before working with VADER we first have to check if NLTK is properly installed.

In [None]:
# if this cell yields an error run the next one
import nltk

In [None]:
# only run this cell if the preceding one yields an error
!pip install nltk

Before we run VADER, we need to install the tool.

In [None]:
# we need to install the vader lexicon first
import nltk
nltk.download('vader_lexicon')

Now we can load and initialize the VADER Sentiment analyzer

In [None]:
from nltk.sentiment import vader
analyzer = vader.SentimentIntensityAnalyzer()

Below you can test VADER yourself by changing the value of the ``text`` variable, and running the code block. 

Can you trick the system? Not very easy isn't it?!

In [None]:
text = "Not interesting."
sentiments_analysis = analyzer.polarity_scores(text)
print(sentiments_analysis)

#### Exercise

Copy-paste the code above, but change the `text` variable and inspect the emotion scores VADER produces (try typing a very positive and a very negative sentence). Also, try to fool VADER by writing complex sentences with negations (such as "not sad").

In [None]:
# insert code here

We are interested here in the compound emotion, a combination of positive and negative sentiment. We can select this specific value by putting the string 'compound' between square brackets:

In [None]:
sentiments_analysis['compound']

We can things a make easier, by writing a function that does this at once, i.e. return the compound score given a text. Just run the cell below, don't worry about the syntax.

In [None]:
# run this cell to create the function
def compound_sentiment(text):
    sentiments_analysis = analyzer.polarity_scores(str(text))
    return sentiments_analysis['compound']

In [None]:
example_text = "This is sooooo not funny!"
compound_sentiment(example_text)

`compound_sentiment` takes a text and returns the compound sentiment computed by VADER. This technique allows us to study emotional behaviour online, based on the posts users write.  

Once we have all the data in a DataFrame we can easily apply the `compound_sentiment` function to **all** comments. For such an operation--applying a function to all rows in a DataFrame--Pandas offers the `.apply()` method. In the cells below, we will investigate emotion in comments on the New York Times' Facebook page.

So let's load the data again. By now, you should know how this works.

In [None]:
url = 'https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/data_nytimes/page_5281959998_2018_12_28_22_00_39_comments.tab'
# insert code here
df = pd....

Let's first try to compute sentiment present in the **posts** (instead of the comments). Look attentively at the syntax below. 

- `df['post_text']`: selects the column we want to use for sentiment mining
- .apply(compound_sentiment): apply the function between the paranthesis to this column

In [None]:
df['post_emotion'] = df['post_text'].apply(compound_sentiment)
df.head()

#### \*\*\*Exercise

Plot a histogram that visualises the distribution of the emotion scores returned by VADER.

In [None]:
# insert code here

What is the average emotion score?

In [None]:
# insert code here

Now apply `compound_emotion` to the **comments** (above we applied it only to the posts). Save these scores in a new column `comment_emotion`.

In [None]:
# insert code here

Plot the distribution of these emotion scores.

In [None]:
# insert code here

Is the average post more negative than the average comment?

In [None]:
# insert code here

## 3. Research Scenario: Finding the Haters (and Lovers?)

In this scenario, we aim to investigate online harrasment by identifying users who comment frequently *and* in a negative way. I selected comments from on an interview with [Taylor Swift](https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/data/videoinfo_P-TFhUq3otQ_2019_01_15-11_21_20_comments.tab). Run the example, and replicate the scenario using a video of your own choice.

We start with loading the data.

In [None]:
import pandas as pd
url = 'https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/data/videoinfo_P-TFhUq3otQ_2019_01_15-11_21_20_comments.tab'
dfcomment=pd.read_csv(url,sep='\t')

The cell below contains a function to identify only the **negative** sentiment present in a post. After running the cell (and loading it into memory) we can apply it to all the comments.

In [None]:
def negative_sentiment(text):
    sentiments_analysis = analyzer.polarity_scores(str(text))
    return sentiments_analysis['neg']

#### Exercise

Change the text variable below, to undestand how `negative_sentiment` works.

In [None]:
text = 'TYPE YOU TEXT HERE'
negative_sentiment(text)

In [None]:
dfcomment['negative_emotion'] = dfcomment['text'].apply(negative_sentiment)

If you print the first 10 rows, you'll see that the DataFrame now contains a new colomn that records the negative emotion present in a comment.

In [None]:
dfcomment.head()

Now, we can count how often a user posted under this video by counting the number of comment ids by user. 

In [None]:
users = dfcomment.groupby('authorName')['id'].count()
users

As we are only interested in the very active users, we sort the Series in descending order.

In [None]:
users_sorted = users.sort_values(ascending=False)
users_sorted

In [None]:
The `.index` attribute contains the actual name of the users, sorted by the number of comments they posted

In [None]:
users_sorted.index

For this example, we only look at the 10 most active users. These we can select by applying the square brackets to the Series.

In [None]:
ten_most_active = users_sorted.index[:10]
ten_most_active

Once we have these names, we can select rows where the user name appears in the `ten_most_active` list. In Pandas, we can create a mask with `list1.isin(list2)`. The method `.isin()` checks whether elements from `list1` appear in `list2`. More precisely, which values in the column 'authorName' appear in the `ten_most_active` Series. The cell below prints the rows for which this condition holds.

In [None]:
dfcomment['authorName'].isin(ten_most_active)

We save rows created by the these most active users in a separate DataFrame. You'll notice that we discard quite some information as most users only post once.

In [None]:
df_active_users = dfcomment[dfcomment['authorName'].isin(ten_most_active)]
df_active_users.shape

Lastly, we rely on `.groupby()` to compute the mean of the negative emotion scores (by user).

In [None]:
negative_commenters = df_active_users.groupby('authorName')['negative_emotion'].mean()
negative_commenters

Oh, and don't forget to sort the users by their negativity.

In [None]:
negative_commenters.sort_values(ascending=False)

With `.loc` we can inspect what these users actually wrote:

In [None]:
pd.options.display.max_colwidth = -1 # ignore this, this is just to print more text
print(dfcomment.loc[dfcomment['authorName'] == 'n i c o l e','text'])
pd.options.display.max_colwidth = 50 # ignore this, this is just to print more text

#### \*\*\*Exercise

Replicate this scenario with a YouTube video of your own choice.

In [None]:
# insert code here

## 4. Research Scenario: Content Analysis and Reactions

In this scenario, we study text in relation to context. Do certain topics elicit more negative reactions than others? Here we look at the perception of Trump on Fox News and the New York Times.

But first, we have to revisit the string methods encountered in Lecture 1. There we inspected several string functions, for example `len()` and `.lower()`. 

#### Exercise

Count the number of characters in the sentence below.

In [None]:
sentence = "Jeremy Corbyn has pledged Labour will call a no-confidence motion in Theresa May’s government “soon”, while again indicating that if he became prime minister he would prefer to negotiate his own Brexit deal rather than call a second referendum."
# insert code here

Now, lowercase the sentence and save the lowercased sentence in a new variable `sentence_lower`.

In [None]:
# insert code here

With Pandas, you can easily apply these functions to a whole text column. Look attentively to the toy example below, before turning towards the main assignment.

In [None]:
example_df = pd.DataFrame([[17,'Hello :-)'],[25,'How are you!'],[121,'Yes!'],[10,"Yihaaaa"]],columns=['likes','text'])
example_df

`len` records the length of each text.

> Notice the use of .str. between the column and the `len()` function.

In [None]:
example_df['text'].str.len()

We can add this column to the DataFrame.

In [None]:
example_df['text_length'] = example_df['text'].str.len()
example_df

Now we have a new column that records the length of the values in the text column.

#### \*\*\*Exercise 

- Retrieve comments from the New York Times (or use [these data](https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/data_nytimes/page_5281959998_2018_12_28_22_00_39_comments.tab) I prepared)
- Compute the length of each comment; add these values as a new column to the DataFrame
- Sort the DataFrame by comment length; print the ten longest comments
- Plot the distribution of the comment lengths using a Histogram (set the `bins` attribute to 100)

In [None]:
# insert code here

#### \*\*\*Exercise 

Simple question: are posts on the New York Times Facebook page on average longer than those on the Fox News page?  Use the techniques you learned in this and previous lecture to answer this question. You can use these data:
- [Fox News](https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/lecture_3_data/foxnews.tab)
- [New York Times](https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/lecture_3_data/nytimes.tab)

In [None]:
# insert code here

Besides len() and lower() we can also apply `.find()` to text columns in DataFrames. Revisiting the toy example above (`example_df`), let's make a new column that has value one for strings that contain the character `a`, otherwise, the value is zero.

First we create a column that has value -1 for strings that do not contain the query term (in this case the character 'a'), otherwise, it records the position of the first hit.

In [None]:
 example_df['text'].str.find('a')

The code below adds this information as a new column to the `example_df` DataFrame:

In [None]:
example_df['has_a'] = example_df['text'].str.find('a')

## Changing the values of cells

Below we change the value for the 'has_a' column. If the value is bigger than -1 (which means that the row contains the query term) we set the value to 1.

> Notice the use of `.loc`. The syntax is similar to selecting a part of a DataFrame using a mask and a column name.

In [None]:
example_df.loc[example_df['has_a'] >= 0,'has_a'] = 1 
example_df

For those rows that do not contain the query term, we set the values to zero.

In [None]:
example_df.loc[example_df['has_a'] < 0] = 0
example_df

We can use a similar technique to find posts about Trump from the New York Times. 

#### Exercise

Explain, in a few words, what happens at each line in the code cells below. Use # to comment on the code.

In [None]:
# load the data
url = 'https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/lecture_3_data/nytimes.tab'
data_nytimes = pd.read_csv(url,sep='\t')
data_nytimes.columns

In [None]:
data_nytimes['about_trump'] = data_nytimes['post_message'].str.find('Trump')
data_nytimes['about_trump']

In [None]:
data_nytimes.loc[data_nytimes['about_trump'] >= 0,'about_trump'] = 1
data_nytimes.loc[data_nytimes['about_trump'] < 0,'about_trump'] = 0

Now "about_trump" is a binary variable: 1 for posts that mention Trump, 0 otherwise.

In [None]:
data_nytimes['about_trump']

Now we can investigate if posts about Trump receive more angry reactions!

In [None]:
trump_angry = data_nytimes.groupby('about_trump')['rea_ANGRY'].mean()
trump_angry

The numbers are convincing, even more so when we use a visualisation.

In [None]:
trump_angry.plot(kind='bar')

#### \*\*\*What about love for Trump on the New York Times?

What about LOVE for Trump?

In [None]:
# insert code here

#### \*\*\* Exercise

Use the [Fox News Dataset](https://raw.githubusercontent.com/kasparvonbeelen/CTH2019/master/lecture_3_data/foxnews.tab). Do you find a similar pattern when inspecting Love and Hate reactions to Trump?

In [None]:
# insert code here

#### \*\*\* Exercise

Let's revisit the posts from Andy Borowitz. He become a famous critique of Donald Trump. The exercise explores whether his followers are more reactive to his comments on the president.

Similar to the example above:
- Create a new column with the name "lowercased" which contains the lowercased text
- Create a column which records if the post contains the string "trump". The column has value 1 if the post text mentions Trump, otherwise it has the value 0.
- Use `.groupby()` to compare the average number of **likes** that posts about Trump receive.
- Inspect also HAHA, LOVE and ANGRY reactions.

In [None]:
# insert code here

# That's all for today! Congratulations again!