# Analyzing the News On Social Media Platforms



## Introduction:

In this analysis, I plan to examine what news outlets, topics and articles are most popular on social media platforms in relation to different news topics. In addition to seeing what types of news these social media platforms are spreading, I will examine how the sentiment scores of the titles and headlines of news articles correlate to their popularity, which can then be related back to the platforms. This analysis will help uncover some of the underlying motivations of social media platforms as well as uncover whether sentiment is related to the popularity of an article and, if they do, whether positive or negative sentiments have a greater correlation with the popularity of news. From a human-centered perspective, this will help inform people about the tendencies of the social media platforms they use and help them to understand how their psychology gravitates toward certain types of news. I hope to understand how people interact with news on social media platforms on a daily basis in order to gain a better understanding about how human psychology interacts with news on social media.

## Research Questions:

*Quantitative Questions*

**1) Does the sentiment of an article title correlate to its popularity?**

Hypothesis: I believe the more negative the sentiment of a title is, the more popular the article will be.

**2) Does the sentiment of an article headline correlate to its popularity?**

Hypothesis: I believe the more negative the sentiment of a title is, the more popular the article will be.

**3) Is there a correlation between the sentiments of article headlines and the article titles?**

Hypothesis: I believe there will be a correlation between the sentiments of the headlines and titles. <br/><br/>

*Qualitative Questions*

**4) What article topics are most popular on social media platforms?**

Hypothesis: I believe social media platforms will popularize articles with more polarizing topics over others (Microsoft and Economy will be the least popular).

**5) Do social media platforms popularize articles with certain sentiments over others?**

Hypothesis: I believe social media platforms will popularize articles with negative sentiments over others.

**6) What sources are most popular on social media platforms?**

Hypothesis: I believe social media platforms will popularize articles with more notable sources over others.


## Data Selected for Analysis:

The dataset [News Popularity in Multiple Social Media Platforms](https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms) represents collected data of about 100,000 different news items on four different topics: Economy, Microsoft, Obama and Palestine over the span of 8 months, between November 2015 and July 2016. To download the raw dataset via the hyperlink above, click "Data Folder", "Data/", "News_Final.csv". 

The dataset has a unique identifier for news items, the title of the news item according to the official media sources, the headline of the news item according to the official media sources, the original news outlet that published the news item, the query topic used to obtain the items in the official media sources, the date and time of the news items' publication, the final value of the news items' popularity according to the social media sources (Facebook, Google+, LinkedIn), the sentiment score of the title, and the sentiment score of the text in the news items' headline. The process to obtain these sentiment scores was carried out by applying the framework of the `qdap` [R package](https://www.rdocumentation.org/packages/qdap/versions/2.4.3) with default parametrization. Additional information about the dataset can be found [here](https://arxiv.org/pdf/1801.07055.pdf).

This dataset is suitable for addressing my research goal because it provides a vast array of different news articles in relation to three different social media platforms as well as a sentiment analysis. I believe that the narrowed scope of four different news topics will allow me to understand the tendencies of these platforms rather than trying to categorize each article into topics as I move through the data. Additionally I believe Facebook, Google+ and LinkedIn are all different enough from each other to gain an understanding of the types of articles that are popluar in the different ecosystems (social, casual, professional) of social media platforms. Some things that need to be noted when looking at this dataset is that the popularity scores of the news items are given by the social media sources, and that the sentiment scores are provided by a machine learning algorithm whose details are not provided in the notebook. This could lead to biased results that either favor the social media companies and/or don't accurately reflect the sentiments of the news items.

In some cases, the value of the popularity of news items at a given moment (i.e. timeslice) was not acquirable.
These cases are denoted with the value −1, which are mostly associated to scenarios where the items are
suggested in official media sources after two days have passed since its publication in the original news outlet.
Such situations represent 12.4% of cases concerning the final popularity of items according to the social media
source Facebook, and 6.2% of cases for both Google+ and LinkedIn sources.

The license for this dataset is a Creative Commons Attribution 4.0 International license (CC BY 4.0), which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.


## Related Works:

Previous research has sought to uncover some of the factors related to news popularity on social media platforms. One study titled [The Pulse of News in Social Media: Forecasting Popularity](https://www.hpl.hp.com/research/scl/papers/newsprediction/pulse.pdf) found that the most important predictor of the popularity of an article was its source. Additionally, another study titled [Are You What You Tweet? The Impact of Sentiment on Digital News Consumption and Social Media Sharing](https://pubsonline.informs.org/doi/10.1287/isre.2022.1112) found that an increase in the sentiment of content increased how much the article was shared on social media, but decreased how many people actually opened the article.

Both of these studies help to inform my the way I approached this study. Specifically, the first study guided me toward asking whether certain news sources were more popular than others. And if so, I wanted to investigate if this was consistent across social media platforms. On the other hand, the second study made me wonder how much variation there was between the sentiment of the title and headline for each article. The second study mentioned previously specifically focused on Twitter to gather their data, so I wanted to see if their findings were consistent across other social media platforms.

## Methodology:

In order to find the answers to my research questions, I will use different methods withinin Python depending on the variables I am using.

For Questions 1, 2, and 3 I will perform **quantitative analysis** by conducting a **Spearman Rank Correlation** in Python. I chose this method because it is most suitable to find out the relationship between two ranked variables (the sentiment scores and the popularity scores) that are non-parametric. I will import the CSV dataset as a dataframe using `Pandas` and import `spearmanr` from `scipy.stats` in order to analyze the data. I will present my results by showing my model output and then describing the meaning behind the specific coefficients produced.

The Spearman Rank function will return a rho-value and a p-value. The rho-value describes the relationship between the two variables being analyzed and will return a value ranging from -1 to +1. A rho-value of -1 denotes a perfect negative relationship between two variables, a rho-value value of 0 denotes no relationship between two variables, and a rho-value value of +1 denotes a perfect positive relationship between two variables. On the other hand, the p-value tells us whether the rho-value is statistically significant. If the p-value of the correlation is less than 0.05, then the correlation is considered to be statistically significant. Anything above this considered negligible.

For Questions 4, 5 and 6, I will perform **qualitative analysis** on the most popular articles for each social media platform. I chose qualitative analysis for these questions because it will allow me to order the data in descending order of popularity and then isolate the desired variable (topics, sentiments and sources). Using Python, I will look at the the 10 most popular articles for each social media platform and determine whether any patterns emerge from the data. I plan on presenting the results of my qualitative analysis as an individual table for each social media platform and each variable being looked at (3 tables for each question). Below each table I will describe if any patterns emerge. For example, Question 4 will present a table containing the variables of popularity, the topic, and the social media platform being looked at. This will be followed by a descriptive analysis. After my description will be the second table for Question 4 which will have the same exact variables except the social media platform will be different. This table will then be followed by its own descriptive analysis, and so forth.



## Findings:

### Question 1: Does the sentiment of an article title correlate to its popularity?

To answer this question, I had to analyze each social media platform (Facebook, Google+, LinkedIn) separately.

#### Q1) Facebook
First I imported the `Pandas` package as `pd` in order to analyze the dataset. Additionally, I imported the Spearman Rank correlation, `spearmanr` from the `scipy.stats` Package in order to analyze the dataset.

In [3]:
#Import Pandas package
import pandas as pd

#Import the Spearman Rank correlation from the SciPy Stats Package
from scipy.stats import spearmanr

Next, I read the dataset *News_Final.csv* as a dataframe using `pd.read_csv()` and save it as `df`. And as stated previously, there are some cases in which the popularity value of a news item could not be acquired. In order to avoid these cases from corrupting our calculations, I removed every instance from `df` where the popularity value is -1 (meaning it could not be acquired) by using `df.loc[df["Facebook"] != -1]`. I specified the popularity values of Facebook because that is the social media platform I will look at first. 

In [4]:
#Read raw dataset as DataFrame
df = pd.read_csv("News_Final.csv")

#Remove all cases in which the popularity value of Facebook is -1 (meaning the popularity value was not acquirable)
df = df.loc[df["Facebook"] != -1]

With the cleaned up DataFrame, I found the correlation bewteen the sentiment of article titles and the Facebook popularity values by conducting a Spearman Rank Correlation `spearmanr(df['SentimentTitle'], df['Facebook'])` that will gave me the rho-value, `rho`, and p-value, `p`, which I then printed.

In [5]:
#Calculate Spearman Rank correlation and corresponding p-value
rho, p = spearmanr(df['SentimentTitle'], df['Facebook'])

#Print Spearman Rank Correlation and p-value
print("The rho-value is " + str(rho))
print("The p-value is " + str(p) + "\n")

The rho-value is -0.02294281802183881
The p-value is 5.5259904327745316e-11



The rho-value of -0.0229 indicates that **there is a very small negative correlation between the sentiment of a title and its popularity on Facebook, meaning articles with more negative titles are correlated to being more popular on Facebook**. And because the p-value is well below the threshold of 0.05, this correlation is statistically significant.

Although the rho-value is close to 0 (which would denote no correlation between the two variables), a large sample size with a very weak correlation rho-value can have a significant p-value. In this case, the weak correlation is not due to chance factors, but because with a large sample the low correlation is a statistically 'real' or representative of the population. This means that although the correlation is small, it exists and is worth noting.

#### Q1) Google+

In order to avoid removing cases where the popularity value of Google+ exists but is -1 for Facebook, I had to re-read the *News_Final.csv* in order to reset the DataFrame. After this, I removed the cases where the popularity values for Google+ are -1. Then I performed the same steps as above to conduct the Spearman Rank correlation and print out the rho-value and p-value, except this time I analyzed `df['GooglePlus']` instead of looking at the values of Facebook.

In [6]:
#Read raw dataset as DataFrame
df = pd.read_csv("News_Final.csv")

#Remove all cases in which the popularity value of Google+ is -1 (meaning the popularity value was not acquirable)
df = df.loc[df["GooglePlus"] != -1]

#Calculate Spearman Rank correlation and corresponding p-value
rho, p = spearmanr(df['SentimentTitle'], df['GooglePlus'])

#Print Spearman rank correlation and p-value
print("The rho-value is " + str(rho))
print("The p-value is " + str(p) + "\n")

The rho-value is -0.016098803472051015
The p-value is 1.9149871400409848e-06



The rho-value of -0.016 indicates that **there is a very small negative correlation between the sentiment of a title and its popularity on Google+, meaning articles with more negative titles are correlated to being more popular on Google+**. And because the p-value is well below the threshold of 0.05, this correlation is statistically significant.

Again, although the rho-value is close to 0 (which would denote no correlation between the two variables) a large sample size with a very weak correlation rho-value can have a significant p-value. In this case, the weak correlation is not due to chance factors, but because with a large sample the low correlation is a statistically 'real' or representative of the population. This means that although the correlation is small, it exists and is worth noting.

#### Q1) LinkedIn

In order to avoid removing cases where the popularity value of LinkedIn exists but is -1 for Google+, I had to re-read the *News_Final.csv* in order to reset the DataFrame. After this, I removed the cases where the popularity values for LinkedIn are -1. Then I performed the same steps as above to conduct the Spearman Rank correlation and print out the rho-value and p-value, , except this time I analyzed `df['LinkedIn']` instead of looking at the values of Facebook.

In [7]:
#Read raw dataset as DataFrame
df = pd.read_csv("News_Final.csv")

#Remove all cases in which the popularity value of LinkedIn is -1 (meaning the popularity value was not acquirable)
df = df.loc[df["LinkedIn"] != -1]

#Calculate Spearman Rank correlation and corresponding p-value
rho, p = spearmanr(df['SentimentTitle'], df['LinkedIn'])

#Print Spearman rank correlation and p-value
print("The rho-value is " + str(rho))
print("The p-value is " + str(p) + "\n")

The rho-value is -0.005139980438811487
The p-value is 0.1284197911979036



The rho-value of -0.0051 indicates that there is a very small negative correlation between the sentiment of a title and its popularity on LinkedIn, meaning articles with more negative titles are correlated to being more popular on LinkedIn. But unlike the previous two correlations the p-value of 0.128 is well above the threshold of 0.05, meaning that **this correlation is not statistically significant**.

### Question 2: Does the sentiment of an article headline correlate to its popularity?

To answer this question, I had to analyze each social media platform (Facebook, Google+, LinkedIn) separately.

#### Q2) Facebook

Completing the steps for this was very similar to Question 1. I first had to read the dataset *News_Final.csv* as a dataframe using `pd.read_csv()` and save it as `df`. Then I removed every instance from `df` where the popularity value is -1 (meaning it could not be acquired) by using `df.loc[df["Facebook"] != -1]`. I specified the popularity values of Facebook because that is the social media platform I will look at first. 

In [8]:
#Read raw dataset as DataFrame
df = pd.read_csv("News_Final.csv")

#Remove all cases in which the popularity value of Facebook is -1 (meaning the popularity value was not acquirable)
df = df.loc[df["Facebook"] != -1]

With the cleaned up DataFrame, I found the correlation bewteen the sentiment of article headlines and the Facebook popularity values by conducting a Spearman Rank Correlation `spearmanr(df['SentimentHeadline'], df['Facebook'])` that will gave me the rho-value, `rho`, and p-value, `p`, which I then printed.

In [9]:
#Calculate Spearman Rank correlation and corresponding p-value
rho, p = spearmanr(df['SentimentHeadline'], df['Facebook'])

#Print Spearman rank correlation and p-value
print("The rho-value is " + str(rho))
print("The p-value is " + str(p) + "\n")

The rho-value is 0.012887608496410779
The p-value is 0.0002310771416921479



The rho-value of 0.0129 indicates that **there is a very small positive correlation between the sentiment of a headline and its popularity on Facebook, meaning articles with more positive headlines are correlated to being more popular on Facebook**. And because the p-value is well below the threshold of 0.05, this correlation is statistically significant.

Although the rho-value is close to 0 (which would denote no correlation between the two variables), a large sample size with a very weak correlation rho-value can have a significant p-value. In this case, the weak correlation is not due to chance factors, but because with a large sample the low correlation is a statistically 'real' or representative of the population. This means that although the correlation is small, it exists and is worth noting.

#### Q2) Google+

In order to avoid removing cases where the popularity value of Google+ exists but is -1 for Facebook, I had to re-read the *News_Final.csv* in order to reset the DataFrame. After this, I removed the cases where the popularity values for Google+ are -1. Then I performed the same steps as above to conduct the Spearman Rank correlation and print out the rho-value and p-value, except this time I analyzed `df['GooglePlus']` instead of looking at the values of Facebook.

In [10]:
#Read raw dataset as DataFrame
df = pd.read_csv("News_Final.csv")

#Remove all cases in which the popularity value of Google+ is -1 (meaning the popularity value was not acquirable)
df = df.loc[df["GooglePlus"] != -1]


#Calculate Spearman Rank correlation and corresponding p-value
rho, p = spearmanr(df['SentimentHeadline'], df['GooglePlus'])

#Print Spearman rank correlation and p-value
print("The rho-value is " + str(rho))
print("The p-value is " + str(p) + "\n")

The rho-value is 0.008825280344565827
The p-value is 0.00904120656682812



The rho-value of 0.0088 indicates that **there is a very small positive correlation between the sentiment of a headline and its popularity on Google+, meaning articles with more positive headlines are correlated to being more popular on Google+**. And because the p-value is well below the threshold of 0.05, this correlation is statistically significant.

Again, although the rho-value is close to 0 (which would denote no correlation between the two variables), a large sample size with a very weak correlation rho-value can have a significant p-value. In this case, the weak correlation is not due to chance factors, but because with a large sample the low correlation is a statistically 'real' or representative of the population. This means that although the correlation is small, it exists and is worth noting.

#### Q2) LinkedIn

In order to avoid removing cases where the popularity value of LinkedIn exists but is -1 for Google+, I had to re-read the *News_Final.csv* in order to reset the DataFrame. After this, I removed the cases where the popularity values for LinkedIn are -1. Then I performed the same steps as above to conduct the Spearman Rank correlation and print out the rho-value and p-value, , except this time I analyzed `df['LinkedIn']` instead of looking at the values of Facebook.

In [11]:
#Read raw dataset as DataFrame
df = pd.read_csv("News_Final.csv")

#Remove all cases in which the popularity value of LinkedIn is -1 (meaning the popularity value was not acquirable)
df = df.loc[df["LinkedIn"] != -1]

#Calculate Spearman Rank correlation and corresponding p-value
rho, p = spearmanr(df['SentimentHeadline'], df['LinkedIn'])

#Print Spearman rank correlation and p-value
print("The rho-value is " + str(rho))
print("The p-value is " + str(p) + "\n")

The rho-value is 0.014376066350593405
The p-value is 2.1138687562768872e-05



The rho-value of 0.014 indicates that **there is a very small positive correlation between the sentiment of a headline and its popularity on LinkedIn, meaning articles with more positive headlines are correlated to being more popular on LinkedIn**. And because the p-value is well below the threshold of 0.05, this correlation is statistically significant.

Again, although the rho-value is close to 0 (which would denote no correlation between the two variables), a large sample size with a very weak correlation rho-value can have a significant p-value. In this case, the weak correlation is not due to chance factors, but because with a large sample the low correlation is a statistically 'real' or representative of the population. This means that although the correlation is small, it exists and is worth noting.

### Question 3: Is there a correlation between the sentiments of article headlines and the article titles?

The steps for this question was very similar to questions 1 and 2. I first had to read the dataset *News_Final.csv* as a dataframe using `pd.read_csv()` and save it as `df`. But unlike the previous 2 questions, I didn't have to remove any cases where the popularity value is -1 because I was only looking at the correlation between the sentiment of the article titles and the sentiment of the article headlines.

In [12]:
#Read raw dataset as DataFrame
df = pd.read_csv("News_Final.csv")

After creating the DataFrame `pd` I conducted a Spearman Rank correlation `spearmanr(df['SentimentTitle'], df['SentimentHeadline'])` similar to previous two questions, and then printed their rho-value, `rho`, and p-value, `p`.

In [13]:
#Calculate Spearman Rank correlation and corresponding p-value
rho, p = spearmanr(df['SentimentTitle'], df['SentimentHeadline'])

#Print Spearman rank correlation and p-value
print("The rho-value is " + str(rho))
print("The p-value is " + str(p) + "\n")

The rho-value is 0.17748034416078898
The p-value is 0.0



The rho-value of 0.177 indicates that **there is a very small positive correlation between the sentiment of an article title and the sentiment of its headline, meaning articles with more positive titles are correlated with having more positive headlines**. And because the p-value is well below the threshold of 0.05, this correlation is statistically significant.

Again, although the rho-value is close to 0 (which would denote no correlation between the two variables), a large sample size with a very weak correlation rho-value can have a significant p-value. In this case, the weak correlation is not due to chance factors, but because with a large sample the low correlation is a statistically 'real' or representative of the population. This means that although the correlation is small, it exists and is worth noting.

### Question 4: What article topics are most popular on social media platforms?
To answer this question, I had to analyze each social media platform (Facebook, Google+, LinkedIn) separately.

*Note: In order for the following code to function properly, you must download the tabulate module downloaded in terminal using* `pip install tabulate`

#### Q4) Facebook

In order to answer this question, I first had to read in the raw dataset *News_Final.csv* using `pd.read_csv()` as a DataFrame, `df`.

In [33]:
#Read raw dataset as DataFrame
df = pd.read_csv("News_Final.csv")

Next I sorted the DataFrame is descending order according to the Facebook popularity scores using `df.sort_values("Facebook", ascending=False)` and cut the DataFrame to the 10 most popular articles using `.head(10)`.

In [34]:
df = df.sort_values("Facebook", ascending=False).head(10)

After this I cut the DataFrame to only contain the columns of `Topic`, `Title` and the `Facebook` Popularity score using `df = df[["Topic", "Title", "Facebook"]]`. The reason I decided to include the Article Title was to give more context about the inclusion of the topic. Finally, I print the DataFrame in a manner that is easily readable using `print(df.to_markdown(index=False)) `.

In [35]:
df = df[["Topic", "Title", "Facebook"]]

print(df.to_markdown(index=False)) 

| Topic     | Title                                                                         |   Facebook |
|:----------|:------------------------------------------------------------------------------|-----------:|
| economy   | Editorial: Welcome rain clouds issues for economy                             |      49211 |
| obama     | Fact Check: Top 10 Lies in Obama's State of the Union                         |      40836 |
| obama     | I Miss Barack Obama                                                           |      32385 |
| obama     | Obama's legacy is at stake                                                    |      30489 |
| economy   | For the Wealthiest, a Private Tax System That Saves Them Billions             |      29564 |
| obama     | How the inner Obama fights ISIS                                               |      24594 |
| obama     | Paul Ryan Betrays America: $1.1 Trillion, 2000-Plus Page Omnibus ...          |      22518 |
| microsoft | Microsoft's 'teen girl'

Before analyzing this table, there are some key takewayas that need to be looked at. First, there is a large discrepancy between the most popular article and the tenth most popular article in the table. Second, **Obama is the most popular topic as it accounts for 5 of the top 10 most popular articles on Facebook**. Additionally, the top 4 article titles could be considered controversial as they are inherently political (and all of them except for the "Fact Check" article are argumentative). **Although not all of these topics are controversial, the topics combined with their titles could be seen in this light.**

#### Q4) Google+

To answer this question for Google+, I repeated everything I did above for Facebook except this time I sorted the 10 most popular values based on the popularity scores of Google+ using `df.sort_values("GooglePlus", ascending=False).head(10)`. Then I cut the DataFrame to only contain the `Topic`, `Title` and the `GooglePlus` popularity score using `df = df[["Topic", "Title", "GooglePlus"]]`. Then I printed it out with the same line of code done previously.

In [39]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("GooglePlus", ascending=False).head(10)

df = df[["Topic", "Title", "GooglePlus"]]

print(df.to_markdown(index=False)) 

| Topic     | Title                                                                         |   GooglePlus |
|:----------|:------------------------------------------------------------------------------|-------------:|
| economy   | Under Sanders, income and jobs would soar, economist says                     |         1267 |
| microsoft | Microsoft is adding the Linux command line to Windows 10                      |         1016 |
| microsoft | Learning the Alphabet                                                         |         1001 |
| microsoft | Microsoft's 'teen girl' AI turns into a Hitler-loving sex robot within 24 ... |          973 |
| economy   | Intervention by PM at G20 working session on Inclusive Growth ...             |          804 |
| microsoft | Microsoft and Canonical partner to bring Ubuntu to Windows 10                                                                               |          781 |
| economy   | For the Wealthiest, a Private Tax System That Sav

The first thing that stands out about this table is Facebook being more popular than Google+ (the most popular Facebook article was shared 49211 times versus 1267 on Google+). The second thing that stands out is that **Microsoft is the most popular article topic on Google+**. And other than the two articles with Obama as its topic, **the majority of the articles on Google+ aren't controversial**.

#### Q4) LinkedIn

To answer this question for LinkedIn, I repeated everything I did above for the other two platforms except this time I sorted the 10 most popular values based on the popularity scores of LinkedIn using `df.sort_values("LinkedIn", ascending=False).head(10)`. Then I cut the DataFrame to only contain the `Topic`, `Title` and the `LinkedIn` popularity score using `df = df[["Topic", "Title", "LinkedIn"]]`. Then I printed it out with the same line of code done previously.

In [40]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("LinkedIn", ascending=False).head(10)

df = df[["Topic", "Title", "LinkedIn"]]

print(df.to_markdown(index=False)) 

| Topic     | Title                                                                                   |   LinkedIn |
|:----------|:----------------------------------------------------------------------------------------|-----------:|
| microsoft | Microsoft to buy LinkedIn for $26.2B in cash, makes big move into ...                   |      20341 |
| microsoft | Microsoft to buy LinkedIn for $26B in cash, makes big move into enterprise social media |      19737 |
| microsoft | Microsoft and LinkedIn: Together Changing the Way the World Works                       |      18004 |
| microsoft | LinkedIn CEO: Here's Why I Sold the Company to Microsoft                                |      10465 |
| microsoft | Microsoft to buy LinkedIn for $26.2 billion; LNKD shares jump 48 pct                    |       9237 |
| microsoft | Microsoft to acquire LinkedIn for $26.2 billion                                         |       8115 |
| microsoft | Microsoft to Buy LinkedIn for $26.2 Billion       

This table clearly illustrates that **Microsoft is the most popular article topic on LinkedIn**. After doing some post-analysis research, this makes a lot of sense since during the time this dataset was collected, Microsoft was in the process of purchasing LinkedIn. And after reading the titles of the articles, **none of the article topics can be considered controversial**.

### Question 5: Do social media platforms popularize articles with certain sentiments over others?
*Note: In order for the following code to function properly, you must download the tabulate module downloaded in terminal using* `pip install tabulate`

#### Q5) Facebook

Similar to Question 4, I first had to read in the raw dataset *News_Final.csv* using `pd.read_csv()` as a DataFrame, `df`.

In [24]:
df = pd.read_csv("News_Final.csv")

Next I sorted the DataFrame is descending order according to the Facebook popularity scores using `df.sort_values("Facebook", ascending=False)` and cut the DataFrame to the 10 most popular articles using `.head(10)`.

In [25]:
df = df.sort_values("Facebook", ascending=False).head(10)

After this I cut the DataFrame to only contain the columns of `SentimentTitle`, `SentimentHeadline` and the `Facebook` Popularity score using `df = df[["SentimentTitle", "SentimentHeadline", "Facebook"]]`. Finally, I print the DataFrame in a manner that is easily readable using `print(df.to_markdown(index=False)) `.

In [26]:
df = df[["SentimentTitle", "SentimentHeadline", "Facebook"]]

print(df.to_markdown(index=False)) 

|   SentimentTitle |   SentimentHeadline |   Facebook |
|-----------------:|--------------------:|-----------:|
|        0.0708683 |           0.18474   |      49211 |
|        0.118585  |          -0.1445    |      40836 |
|       -0.125     |           0.168878  |      32385 |
|        0         |          -0.0573539 |      30489 |
|       -0.113067  |          -0.104257  |      29564 |
|        0         |          -0.0573539 |      24594 |
|        0.265165  |           0.100051  |      22518 |
|       -0.144338  |          -0.0191366 |      22346 |
|       -0.0376889 |           0.200403  |      20371 |
|        0.149691  |          -0.206576  |      19771 |


I was surprised to see that the most popular article on Facebook had a positive sentiment for both its title and headline, since my results in Question 1 proved illustrated that there is a small negative correlation between the sentiment of an article title and its popularity. And after analyzing this entire table, I can conclude that **there is no trend in which sentiments are most popular on Facebook**.

#### Q5) Google+

To answer this question for Google+, I repeated everything I did above for Facebook except I this time I sorted the 10 most popular values based on the popularity scores of Google+ using `df.sort_values("GooglePlus", ascending=False).head(10)`. Then I cut the DataFrame to only contain the `SentimentTitle`, `SentimentHeadline`, and the `GooglePlus` popularity score using `df = df[["SentimentTitle", "SentimentHeadline", "GooglePlus"]]`. Then I printed it out with the same line of code done previously.

In [41]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("GooglePlus", ascending=False).head(10)

df = df[["SentimentTitle", "SentimentHeadline", "GooglePlus"]]

print(df.to_markdown(index=False)) 

|   SentimentTitle |   SentimentHeadline |   GooglePlus |
|-----------------:|--------------------:|-------------:|
|       -0.0416667 |          0.0693172  |         1267 |
|       -0.166667  |         -0.0781828  |         1016 |
|        0         |         -0.375326   |         1001 |
|       -0.144338  |         -0.0191366  |          973 |
|       -0.286581  |         -0.0196419  |          804 |
|       -0.166667  |         -0.00326093 |          781 |
|       -0.113067  |         -0.104257   |          774 |
|       -0.0368932 |         -0.0340529  |          725 |
|        0.09375   |          0.0340965  |          666 |
|       -0.0895255 |         -0.0602381  |          577 |


This table is closer to what I expected to find during my qualitative analysis. 8 out of 10 of the articles have a negative sentiment for its title and for its headline. This table also aligns with the results of Question 3 that the sentiment of an article title positively correlates with the sentiment of its headline. Overall, this table shows that **the most popular articles on Google+ tend to have a negative sentiment for both its title and headline**.

#### Q5) LinkedIn

To answer this question for LinkedIn, I repeated everything I did above for Facebook except I this time I sorted the 10 most popular values based on the popularity scores of LinkedIn using `df.sort_values("LinkedIn", ascending=False).head(10)`. Then I cut the DataFrame to only contain the `SentimentTitle`, `SentimentHeadline`, and the `LinkedIn` popularity score using `df = df[["SentimentTitle", "SentimentHeadline", "LinkedIn"]]`. Then I printed it out with the same line of code done previously.

In [20]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("LinkedIn", ascending=False).head(10)

df = df[["SentimentTitle", "SentimentHeadline", "LinkedIn"]]

print(df.to_markdown(index=False)) 

|   SentimentTitle |   SentimentHeadline |   LinkedIn |
|-----------------:|--------------------:|-----------:|
|        0.0706653 |           0.07      |      20341 |
|        0.0954798 |           0.11875   |      19737 |
|        0.0592927 |          -0.1125    |      18004 |
|        0         |           0.245495  |      10465 |
|       -0.0109801 |           0.136386  |       9237 |
|        0.102062  |           0.0818317 |       8115 |
|        0.051031  |           0         |       6848 |
|        0.0944911 |          -0.028677  |       6682 |
|        0.0578854 |          -0.133436  |       6362 |
|       -0.0333333 |          -0.0279508 |       5222 |


This table shows that the 7 out of 10 of the most popular articles on LinkedIn have a positive sentiment in for its title and 5 out of 10 for its headline. Looking at this table illustrates that **the most popular articles on LinkedIn tend to have a positive sentiment in its title and headline**, although the results from Question 1 proves that this is not a trend across the platform.

### Question 6: What sources are most popular on social media platforms?
*Note: In order for the following code to function properly, you must download the tabulate module downloaded in terminal using* `pip install tabulate`

#### Q6) Facebook

Similar to Question 4 and 5, I first had to read in the raw dataset *News_Final.csv* using `pd.read_csv()` as a DataFrame, `df`.

In [None]:
df = pd.read_csv("News_Final.csv")

Next I sorted the DataFrame is descending order according to the Facebook popularity scores using `df.sort_values("Facebook", ascending=False)` and cut the DataFrame to the 10 most popular articles using `.head(10)`.

In [None]:
df = df.sort_values("Facebook", ascending=False).head(10)

After this I cut the DataFrame to only contain the columns of `Source` and the `Facebook` Popularity score using `df = df[["Source", "Facebook"]]`. Finally, I print the DataFrame in a manner that is easily readable using `print(df.to_markdown(index=False)) `.

In [21]:
df = df[["Source", "Facebook"]]

print(df.to_markdown(index=False)) 

| Source             |   Facebook |
|:-------------------|-----------:|
| New Zealand Herald |      49211 |
| Breitbart News     |      40836 |
| New York Times     |      32385 |
| CNN                |      30489 |
| New York Times     |      29564 |
| CNN                |      24594 |
| Breitbart News     |      22518 |
| Telegraph.co.uk    |      22346 |
| GameZone           |      20371 |
| GameZone           |      19771 |


Looking at this table initially lead me to believe that **there is no trend in which sources are most popular on Facebook** since only 3 of the sources appear more than once, but there is another perspective to be looked at. As a person who gets his news from Google News, I am familiar with New York Times, CNN, Telegraph and GameZone, and these sources account for 8 out of 10 of the most popular articles on Facebook. Although this is biased due to it being my sole perspective on news sources, I find it interesting that the most popular sources on Facebook are notable in their own right.

#### Q6) Google+

To answer this question for Google+, I repeated everything I did above for Facebook except I this time I sorted the 10 most popular values based on the popularity scores of Google+ using `df.sort_values("GooglePlus", ascending=False).head(10)`. Then I cut the DataFrame to only contain the `Source` and the `GooglePlus` popularity score using `df = df[["Source", "GooglePlus"]]`. Then I printed it out with the same line of code done previously.

In [44]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("GooglePlus", ascending=False).head(10)

df = df[["Source", "GooglePlus"]]

print(df.to_markdown(index=False)) 

| Source                               |   GooglePlus |
|:-------------------------------------|-------------:|
| CNNMoney                             |         1267 |
| The Verge                            |         1016 |
| The Verge                            |         1001 |
| Telegraph.co.uk                      |          973 |
| Narendra Modi (press release) (blog) |          804 |
| ZDNet                                |          781 |
| New York Times                       |          774 |
| Raw Story                            |          725 |
| Breitbart News                       |          666 |
| The Verge                            |          577 |


Again it is hard to analyze these news sources due to my own bias of what sources I know and don't know, but compared to Facebook this table seems to have a greater variety of sources. Due to the fact that the most popular articles on Google+ include a press release and three sources that I have never heard of (ZDNet, Raw Story and Breitbart News), I am inclined to conclude that **there is no trend in which sources are most popular on Google+**.

#### Q6) LinkedIn

To answer this question for LinkedIn, I repeated everything I did above for Facebook except I this time I sorted the 10 most popular values based on the popularity scores of LinkedIn using `df.sort_values("LinkedIn", ascending=False).head(10)`. Then I cut the DataFrame to only contain the `Source` and the `LinkedIn` popularity score using `df = df[["Source", "LinkedIn"]]`. Then I printed it out with the same line of code done previously.

In [43]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("LinkedIn", ascending=False).head(10)

df = df[["Source", "LinkedIn"]]

print(df.to_markdown(index=False)) 

| Source                  |   LinkedIn |
|:------------------------|-----------:|
| TechCrunch              |      20341 |
| Techcrunch              |      19737 |
| LinkedIn (blog)         |      18004 |
| TIME                    |      10465 |
| CNBC                    |       9237 |
| The Verge               |       8115 |
| Wall Street Journal     |       6848 |
| The Wall Street Journal |       6682 |
| Politico                |       6362 |
| Bloomberg               |       5222 |


The first thing I noticed about this table was that I recognized every source in it. Not only that, but that all of the sources are related to business and professional news (but not exclusive). Knowing these two things, I can confidentally infer that **the most popular sources on LinkedIn are notable and tend to be related to the professionalism that LinkedIn as a platform is used for**.

### Discussion:

In order to discuss my findings, I am going to talk about each question separately while making some connections between findings along the way.

##### Question 1) Does the sentiment of an article title correlate to its popularity?

Although the correlation was small, my findings for Facebook and Google+ (and LinkedIn although it was statistically insignificant) aligned with my hypothesis that the sentiment of an article title negatively correlated with its popularity. This means that articles with more negative titles tend to be more popular on Facebook and Google+, even if it occurs unintentionally. This is something that should be taken into consideration by users when using these social media platforms for news information.

##### Question 2) Does the sentiment of an article headline correlate to its popularity?

The results from this analysis showed that all three social media platforms had a positive correlation bewteen the sentiment of the article headline and its popularity. This is interesting given the negative correlations present on the previous question. This finding may mean that there isn't a causal relationship between sentiment and popularity, but this can only be confirmed with further research.

##### Question 3) Is there a correlation between the sentiments of article headlines and the article titles?

This was an important question to research because if they were negatively correlated (and the sentiment of an article title was negatively correlated to popularity), it could suggest that the publishers were manipulating the article titles to increase their popularity. Unfortunately (for the sake of this research) the sentiements of the article headlines and article titles were positively correlated eliminating this theory.

##### Question 4) What article topics are most popular on social media platforms?

I predicted that the most controversial topics would be the most popular articles on social media, but only one of the social media platforms (Facebook) aligned with this theory. Facebook has been known for its [controversial news feed algorithm](https://www.washingtonpost.com/technology/2021/11/13/facebook-news-feed-algorithm-how-to-turn-it-off/), so this makes sense. On the other hand, Google+ and LinkedIn disproved my hypothesis. Since LinkedIn promoted Microsoft content (which purchased LinkedIn) and Facebook promoted controversial topics I believe that rather than simply promoting controversial topics, social media platforms promote news articles that are beneficial to their cause.

##### Question 5) Do social media platforms popularize articles with certain sentiments over others?

Due to the results from Questions 1 and 2 being mixed, I didn't know if this question would bring about anything worth noting (and unfortunately it did not). Because the most popular articles on Google+ were mainly negative in its titles and headlines, the most popular articles on LinkedIn were mainly positive in its titles and headlines, and the most popular articles on Facebook didn't show a trend with regards to sentiment, I cannot make any claims about the subject. Looking at a greater amount of article may help, but the very small correlations from Questions 1 and 2 imply that the results may not differ.

##### Question 6) What sources are most popular on social media platforms?

Although LinkedIn showed a trend in what types of article sources were the most popular, this was not my original intention when asking this question. I wanted to see if social media platforms promoted sources that were either controversial, not well-known, or extremely notable (Fox, CNN, WSJ, etc.). LinkedIn showed some promise of the most popular sources being notable, but its variation in sources and the subjectivity around this topic makes it difficult for me to make any claims. Additionally, the fact that Facebook and Google+ didn't seem to show any trends with regards to the popularity of sources supports this reasoning.

### Implications

Due to the correlations found during this study, this implies that there is a chance some of these correlations could contain causality. Further research should look further into whether or not there is a causality between sentiment and popularity.

Additionally, the findings in this research suggests that there isn't a "standard" for social media across platforms. Further research should dive deeper into specific platforms, especially since their popularity has only risen since the collection of this dataset.

### Limitations

The biggest limitation of this dataset is not having the number of users on each platform during the time of the dataset collection. This would have provided more context about the social media platforms. Additionally Google+ is no longer a social media platform and never became popular compared to LinkedIn and Facebook, mitigating its present-day relevance of its findings.

Another limitation of this dataset is that it doesn't prove causality. Obviously it is beneficial to research whether a correlation exists between two variables, but researching causality would highlight if social media platforms have underlying motivations with regards to their news.

### Conclusion:

The goal of this study was to uncover the hidden motivations of social media platforms as well as inform users about their tendecies in relation to the news they interact with on social media. Although a lot of the findings weren't related, **I can conclude that there is a correlation between sentiment and popularity, but it is inconsistent between article titles and headlines as well as across platforms**. Using a combination of quantitative and qualitative analysises helped to reveal some of the trends of news on social media, but further research is necessary. This study provides a solid framework to conduct further research about how people interact with news on social media and whether there is a causal relationship between sentiment and popularity.

### References:

Torgo, Lus & Moniz, Nuno. (2018). News Popularity in Multiple Social Media Platforms. UCI Machine Learning Repository.