# A4: Final Project Preliminary Proposal



### Motivation and Problem Statement:

In this analysis, I plan to examine what news outlets, topics and articles are most popular on social media platforms in relation to different news topics. In addition to seeing what types of news these social media platforms are spreading, I will examine how the sentiment scores of the titles and headlines of news articles correlate to their popularity, which can then be related back to the platforms. This analysis will help uncover some of the underlying motivations of social media platforms as well as uncover whether sentiment is related to the popularity of an article and, if they do, whether positive or negative sentiments have a greater correlation with the popularity of news. From a human-centered perspective, this will help inform people about the tendencies of the social media platforms they use and help them to understand how their psychology gravitates toward certain types of news. I hope to understand how people interact with news on social media platforms on a daily basis in order to gain a better understanding about how human psychology interacts with news on social media.

### Research Questions:

*Quantitative Questions*

**1) Does the sentiment of an article title correlate to its popularity?**

Hypothesis: I believe the more negative the sentiment of a title is, the more popular the article will be.

**2) Does the sentiment of an article headline correlate to its popularity?**

Hypothesis: I believe the more negative the sentiment of a title is, the more popular the article will be.

**3) Is there a correlation between the sentiments of article headlines and the article titles?**

Hypothesis: I believe there will be a correlation between the sentiments of the headlines and titles. <br/><br/>

*Qualitative Questions*

**4) Do social media platforms popularize article with certain topics over others?**

Hypothesis: I believe social media platforms with more polarizing topics over others.

**5) Do social media platforms popularize articles with certain sentiments over others?**

Hypothesis: I believe social media platforms will popularize articles with negative sentiments over others since it will lead to more engagement.

**6) Do social media platforms popularize articles with certain sources over other?**

Hypothesis: I believe social media platforms will popularize articles with more notable sources over others.


### Data Selected for Analysis:

The dataset [News Popularity in Multiple Social Media Platforms](https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms) represents collected data of about 100,000 different news items on four different topics: Economy, Microsoft, Obama and Palestine over the span of 8 months, between November 2015 and July 2016. To download the raw dataset via the hyperlink above, click "Data Folder", "Data/", "News_Final.csv". 

The dataset has a unique identifier for news items, the title of the news item according to the official media sources, the headline of the news item according to the official media sources, the original news outlet that published the news item, the query topic used to obtain the items in the official media sources, the date and time of the news items' publication, the final value of the news items' popularity according to the social media sources (Facebook, Google+, LinkedIn), the sentiment score of the title, and the sentiment score of the text in the news items' headline. The process to obtain these sentiment scores was carried out by applying the framework of the `qdap` [R package](https://www.rdocumentation.org/packages/qdap/versions/2.4.3) with default parametrization. Additional information about the dataset can be found [here](https://arxiv.org/pdf/1801.07055.pdf).

This dataset is suitable for addressing my research goal because it provides a vast array of different news articles in relation to three different social media platforms as well as a sentiment analysis. I believe that the narrowed scope of four different news topics will allow me to understand the tendencies of these platforms rather than trying to categorize each article into topics as I move through the data. Additionally I believe Facebook, Google+ and LinkedIn are all different enough from each other to gain an understanding of the types of articles that are popluar in the different ecosystems (social, casual, professional) of social media platforms. Some things that need to be noted when looking at this dataset is that the popularity scores of the news items are given by the social media sources, and that the sentiment scores are provided by a machine learning algorithm whose details are not provided in the notebook. This could lead to biased results that either favor the social media companies and/or don't accurately reflect the sentiments of the news items.

In some cases, the value of the popularity of news items at a given moment (i.e. timeslice) was not acquirable.
These cases are denoted with the value −1, which are mostly associated to scenarios where the items are
suggested in official media sources after two days have passed since its publication in the original news outlet.
Such situations represent 12.4% of cases concerning the final popularity of items according to the social media
source Facebook, and 6.2% of cases for both Google+ and LinkedIn sources.

The license for this dataset is a Creative Commons Attribution 4.0 International license (CC BY 4.0), which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.


### Related Works:

Previous research has sought to uncover some of the factors related to news popularity on social media platforms. One study titled [The Pulse of News in Social Media: Forecasting Popularity](https://www.hpl.hp.com/research/scl/papers/newsprediction/pulse.pdf) found that the most important predictor of the popularity of an article was its source. Additionally, another study titled [Are You What You Tweet? The Impact of Sentiment on Digital News Consumption and Social Media Sharing](https://pubsonline.informs.org/doi/10.1287/isre.2022.1112) found that an increase in the sentiment of content increased how much the article was shared on social media, but decreased how many people actually opened the article.

Both of these studies help to inform my the way I approached this study. Specifically, the first study guided me toward asking whether certain news sources were more popular than others. And if so, I wanted to investigate if this was consistent across social media platforms. On the other hand, the second study made me wonder how much variation there was between the sentiment of the title and headline for each article. The second study mentioned previously specifically focused on Twitter to gather their data, so I wanted to see if their findings were consistent across other social media platforms.

### Methodology:

In order to find the answers to my research questions, I will use different methods withinin Python depending on the variables I am using.

For Questions 1, 2, and 3 I will perform **quantitative analysis** by conducting a **Spearman Rank Correlation** in Python. I chose this method because it is most suitable to find out the relationship between two ranked variables (the sentiment scores and the popularity scores) that are non-parametric. I will import the CSV dataset as a dataframe using `Pandas` and import `spearmanr` from `scipy.stats` in order to analyze the data. I will present my results by showing my model output and then describing the meaning behind the specific coefficients produced.

For Questions 4, 5 and 6, I will perform **qualitative analysis** on the most popular articles for each social media platform. I chose qualitative analysis for these questions because it will allow me to order the data in descending order of popularity and then isolate the desired variable (topics, sentiments and sources). Using Python, I will look at the the 10 most popular articles for each social media platform and determine whether any patterns emerge from the data. I plan on presenting the results of my qualitative analysis as an individual table for each social media platform and each variable being looked at (3 tables for each question). Below each table I will describe if any patterns emerge. For example, Question 4 will present a table containing the variables of popularity, the topic, and the social media platform being looked at. This will be followed by a descriptive analysis. After my description will be the second table for Question 4 which will have the same exact variables except the social media platform will be different. This table will then be followed by its own descriptive analysis, and so forth.



### Findings:

In [2]:
import pandas as pd

df = pd.read_csv("News_Final.csv")
print(df.shape)
df = df.loc[df["Facebook"] != -1]
print(df.shape)

from scipy.stats import spearmanr

#calculate Spearman Rank correlation and corresponding p-value
rho, p = spearmanr(df['SentimentTitle'], df['Facebook'])

#print Spearman rank correlation and p-value
print(rho)

print(p)

print("With a large sample size a very weak correlation Rs value can have a significant p-value. In this case, the weak correlation is not due to chance factors, but because with a large sample the low correlation is a statistically 'real' or representative of the population.")

(93239, 11)
(81637, 11)
-0.02294281802183881
5.5259904327745316e-11
With a large sample size a very weak correlation Rs value can have a significant p-value. In this case, the weak correlation is not due to chance factors, but because with a large sample the low correlation is a statistically 'real' or representative of the population.


In [23]:
df = pd.read_csv("News_Final.csv")
print(df.shape)
df = df.loc[df["GooglePlus"] != -1]
print(df.shape)

from scipy.stats import spearmanr

#calculate Spearman Rank correlation and corresponding p-value
rho, p = spearmanr(df['SentimentTitle'], df['GooglePlus'])

#print Spearman rank correlation and p-value
print(rho)

print(p)

(93239, 11)
(87495, 11)
-0.016098803472051015
1.9149871400409848e-06


In [24]:
df = pd.read_csv("News_Final.csv")
print(df.shape)
df = df.loc[df["LinkedIn"] != -1]
print(df.shape)

from scipy.stats import spearmanr

#calculate Spearman Rank correlation and corresponding p-value
rho, p = spearmanr(df['SentimentTitle'], df['LinkedIn'])

#print Spearman rank correlation and p-value
print(rho)

print(p)

(93239, 11)
(87494, 11)
-0.005139980438811487
0.1284197911979036


In [25]:
df = pd.read_csv("News_Final.csv")
print(df.shape)
df = df.loc[df["Facebook"] != -1]
print(df.shape)

from scipy.stats import spearmanr

#calculate Spearman Rank correlation and corresponding p-value
rho, p = spearmanr(df['SentimentHeadline'], df['Facebook'])

#print Spearman rank correlation and p-value
print(rho)

print(p)

(93239, 11)
(81637, 11)
0.012887608496410779
0.0002310771416921479


In [26]:
df = pd.read_csv("News_Final.csv")
print(df.shape)
df = df.loc[df["GooglePlus"] != -1]
print(df.shape)

from scipy.stats import spearmanr

#calculate Spearman Rank correlation and corresponding p-value
rho, p = spearmanr(df['SentimentHeadline'], df['GooglePlus'])

#print Spearman rank correlation and p-value
print(rho)

print(p)

(93239, 11)
(87495, 11)
0.008825280344565827
0.00904120656682812


In [40]:
df = pd.read_csv("News_Final.csv")
print(df.shape)
df = df.loc[df["LinkedIn"] != -1]
print(df.shape)

from scipy.stats import spearmanr

#calculate Spearman Rank correlation and corresponding p-value
rho, p = spearmanr(df['SentimentHeadline'], df['LinkedIn'])

#print Spearman rank correlation and p-value
print(rho)

print(p)

(93239, 11)
(87494, 11)
0.014376066350593405
2.1138687562768872e-05


In [4]:
df = pd.read_csv("News_Final.csv")
print(df.shape)

from scipy.stats import spearmanr

#calculate Spearman Rank correlation and corresponding p-value
rho, p = spearmanr(df['SentimentTitle'], df['SentimentHeadline'])

#print Spearman rank correlation and p-value
print(rho)

print(p)

(93239, 11)
0.17748034416078898
0.0


In [147]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("Facebook", ascending=False).head(20)

df = df[["Topic", "Facebook"]]

print(df.to_markdown(index=False)) 

| Topic     |   Facebook |
|:----------|-----------:|
| economy   |      49211 |
| obama     |      40836 |
| obama     |      32385 |
| obama     |      30489 |
| economy   |      29564 |
| obama     |      24594 |
| obama     |      22518 |
| microsoft |      22346 |
| microsoft |      20371 |
| microsoft |      19771 |
| obama     |      19136 |
| obama     |      17170 |
| economy   |      16993 |
| obama     |      16598 |
| obama     |      15692 |
| obama     |      15623 |
| obama     |      15606 |
| microsoft |      15250 |
| obama     |      14952 |
| microsoft |      14610 |


In [146]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("GooglePlus", ascending=False).head(20)

df = df[["Topic", "GooglePlus"]]

print(df.to_markdown(index=False)) 

| Topic     |   GooglePlus |
|:----------|-------------:|
| economy   |         1267 |
| microsoft |         1016 |
| microsoft |         1001 |
| microsoft |          973 |
| economy   |          804 |
| microsoft |          781 |
| economy   |          774 |
| obama     |          725 |
| obama     |          666 |
| microsoft |          577 |
| microsoft |          555 |
| microsoft |          544 |
| microsoft |          504 |
| economy   |          468 |
| obama     |          463 |
| obama     |          451 |
| microsoft |          436 |
| microsoft |          435 |
| obama     |          432 |
| obama     |          425 |


In [145]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("LinkedIn", ascending=False).head(20)

df = df[["Topic", "LinkedIn"]]

print(df.to_markdown(index=False)) 

| Topic     |   LinkedIn |
|:----------|-----------:|
| microsoft |      20341 |
| microsoft |      19737 |
| microsoft |      18004 |
| microsoft |      10465 |
| microsoft |       9237 |
| microsoft |       8115 |
| microsoft |       6848 |
| microsoft |       6682 |
| obama     |       6362 |
| microsoft |       5222 |
| economy   |       4328 |
| microsoft |       4259 |
| microsoft |       4059 |
| economy   |       3716 |
| microsoft |       3652 |
| economy   |       3433 |
| microsoft |       3128 |
| microsoft |       3087 |
| microsoft |       2963 |
| microsoft |       2653 |


In [144]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("Facebook", ascending=False).head(20)

df = df[["SentimentTitle", "SentimentHeadline", "Facebook"]]

print(df.to_markdown(index=False)) 

|   SentimentTitle |   SentimentHeadline |   Facebook |
|-----------------:|--------------------:|-----------:|
|        0.0708683 |           0.18474   |      49211 |
|        0.118585  |          -0.1445    |      40836 |
|       -0.125     |           0.168878  |      32385 |
|        0         |          -0.0573539 |      30489 |
|       -0.113067  |          -0.104257  |      29564 |
|        0         |          -0.0573539 |      24594 |
|        0.265165  |           0.100051  |      22518 |
|       -0.144338  |          -0.0191366 |      22346 |
|       -0.0376889 |           0.200403  |      20371 |
|        0.149691  |          -0.206576  |      19771 |
|       -0.0416667 |          -0.104257  |      19136 |
|       -0.0131762 |          -0.0567511 |      17170 |
|       -0.0416667 |           0.0693172 |      16993 |
|       -0.0610515 |          -0.243403  |      16598 |
|       -0.102062  |          -0.436657  |      15692 |
|        0         |           0.0818317 |      

In [129]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("GooglePlus", ascending=False).head(20)

df = df[["SentimentTitle", "SentimentHeadline", "GooglePlus"]]

print(df.to_markdown(index=False)) 

|   SentimentTitle |   SentimentHeadline |   GooglePlus |
|-----------------:|--------------------:|-------------:|
|       -0.0416667 |          0.0693172  |         1267 |
|       -0.166667  |         -0.0781828  |         1016 |
|        0         |         -0.375326   |         1001 |
|       -0.144338  |         -0.0191366  |          973 |
|       -0.286581  |         -0.0196419  |          804 |
|       -0.166667  |         -0.00326093 |          781 |
|       -0.113067  |         -0.104257   |          774 |
|       -0.0368932 |         -0.0340529  |          725 |
|        0.09375   |          0.0340965  |          666 |
|       -0.0895255 |         -0.0602381  |          577 |
|        0.0706653 |          0.07       |          555 |
|        0.0954798 |          0.11875    |          544 |
|       -0.0883883 |         -0.0369244  |          504 |
|        0.104494  |          0.0781929  |          468 |
|        0         |          0          |          463 |
|        0.044

In [139]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("LinkedIn", ascending=False).head(20)

df = df[["SentimentTitle", "SentimentHeadline", "LinkedIn"]]

print(df.to_markdown(index=False)) 

|   SentimentTitle |   SentimentHeadline |   LinkedIn |
|-----------------:|--------------------:|-----------:|
|        0.0706653 |          0.07       |      20341 |
|        0.0954798 |          0.11875    |      19737 |
|        0.0592927 |         -0.1125     |      18004 |
|        0         |          0.245495   |      10465 |
|       -0.0109801 |          0.136386   |       9237 |
|        0.102062  |          0.0818317  |       8115 |
|        0.051031  |          0          |       6848 |
|        0.0944911 |         -0.028677   |       6682 |
|        0.0578854 |         -0.133436   |       6362 |
|       -0.0333333 |         -0.0279508  |       5222 |
|        0.125     |          0.10549    |       4328 |
|        0.051031  |          0.129099   |       4259 |
|        0.285052  |          0.346586   |       4059 |
|       -0.391312  |         -0.384448   |       3716 |
|        0.0833333 |          0.108821   |       3652 |
|        0.100504  |          0.0313594  |      

In [131]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("Facebook", ascending=False).head(20)

df = df[["Source", "Facebook"]]

print(df.to_markdown(index=False)) 

| Source             |   Facebook |
|:-------------------|-----------:|
| New Zealand Herald |      49211 |
| Breitbart News     |      40836 |
| New York Times     |      32385 |
| CNN                |      30489 |
| New York Times     |      29564 |
| CNN                |      24594 |
| Breitbart News     |      22518 |
| Telegraph.co.uk    |      22346 |
| GameZone           |      20371 |
| GameZone           |      19771 |
| Liberty News Now   |      19136 |
| Global Grind       |      17170 |
| CNNMoney           |      16993 |
| Fox News           |      16598 |
| FrontPage Magazine |      15692 |
| New York Times     |      15623 |
| Washington Post    |      15606 |
| The Verge          |      15250 |
| WPEC               |      14952 |
| Investopedia       |      14610 |


In [142]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("GooglePlus", ascending=False).head(20)

df = df[["Source", "GooglePlus"]]

print(df.to_markdown(index=False)) 

| Source                               |   GooglePlus |
|:-------------------------------------|-------------:|
| CNNMoney                             |         1267 |
| The Verge                            |         1016 |
| The Verge                            |         1001 |
| Telegraph.co.uk                      |          973 |
| Narendra Modi (press release) (blog) |          804 |
| ZDNet                                |          781 |
| New York Times                       |          774 |
| Raw Story                            |          725 |
| Breitbart News                       |          666 |
| The Verge                            |          577 |
| TechCrunch                           |          555 |
| Techcrunch                           |          544 |
| The Intercept                        |          504 |
| The Guardian                         |          468 |
| Washington Post                      |          463 |
| The Hill (blog)                      |        

In [143]:
df = pd.read_csv("News_Final.csv")

df = df.sort_values("LinkedIn", ascending=False).head(20)

df = df[["Source", "LinkedIn"]]

print(df.to_markdown(index=False)) 

| Source                  |   LinkedIn |
|:------------------------|-----------:|
| TechCrunch              |      20341 |
| Techcrunch              |      19737 |
| LinkedIn (blog)         |      18004 |
| TIME                    |      10465 |
| CNBC                    |       9237 |
| The Verge               |       8115 |
| Wall Street Journal     |       6848 |
| The Wall Street Journal |       6682 |
| Politico                |       6362 |
| Bloomberg               |       5222 |
| Winnipeg Free Press     |       4328 |
| CNN Money               |       4259 |
| Forbes                  |       4059 |
| Winnipeg Free Press     |       3716 |
| New York Times          |       3652 |
| Winnipeg Free Press     |       3433 |
| Inc.com                 |       3128 |
| Inc.com                 |       3087 |
| The Verge               |       2963 |
| WIRED                   |       2653 |


### Discussion:

### Conclusion:

### References:

Torgo, Lus & Moniz, Nuno. (2018). News Popularity in Multiple Social Media Platforms. UCI Machine Learning Repository.