# Reddit-News Data Analyzer

The following notebook will focus on doing some data manipulation and data analysis on Reddit-News data from 2008-2016. The notebook is split into several sections to keep it organized and show the user clear steps in running the application



1. Library Installations
2. Initial insights into the data
3. Top 10 Topics discussed
4. Count of good news and bad news

For Section 4, I used the TextBlob library to perform sentiment analysis on the news data sent. I used both the Pyspark DF library as well as Pandas DF. Pandas DF was easier to use and manipulate for sentiment analysis but Spark provides parallel processing on different nodes in the cluster. 





# Library Installations 

In [None]:
!pip install gensim



In [None]:
!pip install nltk



In [12]:
!pip install TextBlob



## Install PySpark library

In [7]:
!pip install pyspark
import os
import sys
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
import pyspark.sql.functions as f

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)
[K     |████████████████████████████████| 204.2MB 70kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 40.5MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612243 sha256=abdd401cdbc7e939d3e6504aae796f6168406181aaf8cf89859b09cfe12ec570
  Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1


In [None]:
import os
!wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
!tar -xvf /content/spark-3.0.1-bin-hadoop2.7.tgz
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop2.7"

In [10]:
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
spark = SparkSession \
    .builder \
    .getOrCreate()

# Get Reddit-News data from Github

In [8]:
!wget -q https://raw.githubusercontent.com/ashfarhangi/Massive_Storage_and_Big_Data/master/data/Reddit-News.csv

# Top 10 most discussed topics from 2008-2016


In [11]:
import gensim

# create list of stop words to use for filtering
stop_words = gensim.parsing.preprocessing.STOPWORDS.union(set(['new', 'news', 'says']))

# read in the reddt-news file and split it by line 
reddit_news = sc.textFile('Reddit-News.csv').map(lambda line: line.split(',', 1)[-1])

# take each line in the RDD and split it by white space to get individual words
words = reddit_news.flatMap(lambda line: line.lower().split(' '))

# filter stopwords out and get count of all word occurences 
word_count = words.filter(lambda word: word not in stop_words and len(word) > 2).map(lambda word: (word, 1)).reduceByKey(lambda a,b: a+b)

# sort by count and take the top 10 entries
most_common_words = word_count.map(lambda pair: (pair[1], pair[0])).sortByKey(False).take(10)

# display the top 10 most discussed topics
print('The top 10 most discussed topics are:')
for count,pair in enumerate(most_common_words):
  print('#{} topic: "{}" with {} occurences'.format(count+1, pair[-1], pair[0]))

The top 10 most discussed topics are:
#1 topic: "police" with 2567 occurences
#2 topic: "government" with 2473 occurences
#3 topic: "people" with 2324 occurences
#4 topic: "world" with 1913 occurences
#5 topic: "u.s." with 1863 occurences
#6 topic: "china" with 1759 occurences
#7 topic: "israel" with 1722 occurences
#8 topic: "killed" with 1720 occurences
#9 topic: "president" with 1705 occurences
#10 topic: "war" with 1695 occurences


# Senitment Analysis using TextBlob library

The following Sentiment Analysis of the Reddit-News data uses the TextBlob library and Pandas dataframe as well as the Pyspark dataframe with TextBlob for comparison

## Using Pandas Dataframe with TextBlob library for sentiment analysis

In [13]:
import pandas as pd
from textblob import TextBlob

# read in the reddit-news as a pandas dataframe 
reddit_news_df = pd.read_csv('Reddit-News.csv', parse_dates=True, index_col='Date')

# apply the TextBlob sentiment analysis to each row containing news headline
reddit_news_df['Sentiment Score'] = reddit_news_df['News'].apply(lambda headline: TextBlob(headline).sentiment.polarity) 
display(reddit_news_df)

# get good and bad news count for each sentiment score
sentiment_scores = reddit_news_df['Sentiment Score']
good_news_count = reddit_news_df[reddit_news_df['Sentiment Score'] > 0].count()
bad_news_count = reddit_news_df[reddit_news_df['Sentiment Score'] < 0].count()
print('Good news count: \n{}\n'.format(good_news_count))
print('Bad news count: \n{}\n'.format(bad_news_count))



Unnamed: 0_level_0,News,Sentiment Score
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-07-01,A 117-year-old woman in Mexico City finally re...,-0.066667
2016-07-01,IMF chief backs Athens as permanent Olympic host,0.000000
2016-07-01,"The president of France says if Brexit won, so...",0.000000
2016-07-01,British Man Who Must Give Police 24 Hours' Not...,0.111111
2016-07-01,100+ Nobel laureates urge Greenpeace to stop o...,0.000000
...,...,...
2008-06-08,b'Man goes berzerk in Akihabara and stabs ever...,-0.200000
2008-06-08,b'Threat of world AIDS pandemic among heterose...,0.000000
2008-06-08,b'Angst in Ankara: Turkey Steers into a Danger...,-0.600000
2008-06-08,"b""UK: Identity cards 'could be used to spy on ...",0.059091


Good news count: 
News               21464
Sentiment Score    21464
dtype: int64

Bad news count: 
News               17696
Sentiment Score    17696
dtype: int64



## Using PySpark dataframes with TextBlob library for sentiment analysis 

In [None]:
from textblob import TextBlob

# function used to return the sentiment of the passed in news headline
def find_sentiment(news_headline):
  sentiment_score = TextBlob(news_headline).sentiment.polarity
  return sentiment_score

In [14]:
from pyspark.sql.types import DoubleType

# read in reddit-news and display schema info
reddit_news_df = spark.read.csv('Reddit-News.csv', inferSchema=True, header=True)
reddit_news_df.printSchema()
news_data = reddit_news_df.select('News')

# create a user-defined function that will apply find_sentiment to passed in headlines
sentiment_udf = f.udf(find_sentiment, DoubleType())
spark.udf.register('sentiment', sentiment_udf)

# get sentiment scores and create new column with sentiment scores for each headline
news_data_with_sentiment = reddit_news_df.withColumn('Sentiment Score', sentiment_udf('News').cast('double'))
news_data_with_sentiment.show()

root
 |-- Date: string (nullable = true)
 |-- News: string (nullable = true)

+----------+--------------------+--------------------+
|      Date|                News|     Sentiment Score|
+----------+--------------------+--------------------+
|2016-07-01|A 117-year-old wo...|-0.06666666666666667|
|2016-07-01|IMF chief backs A...|                 0.0|
|2016-07-01|The president of ...|                 0.0|
|2016-07-01|British Man Who M...| 0.11111111111111112|
|2016-07-01|100+ Nobel laurea...|                 0.0|
|2016-07-01|Brazil: Huge spik...|  0.4000000000000001|
|2016-07-01|Austria's highest...|                -0.2|
|2016-07-01|Facebook wins pri...|                0.25|
|2016-07-01|Switzerland denie...|                 0.0|
|2016-07-01|China kills milli...|                 0.5|
|2016-07-01|France Cracks Dow...| -0.1277777777777778|
|2016-07-01|Abbas PLO Faction...|                 0.0|
|2016-07-01|Taiwanese warship...|                 0.0|
|2016-07-01|Iran celebrates A...|         