# Initial Exploratory Data Analysis
## Initial Setup

In the following code we check our configuration and set up useful functions.

In [1]:
spark

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
10,application_1575479622115_0011,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<pyspark.sql.session.SparkSession object at 0x7f0029998b38>

In [2]:
%%info

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
10,application_1575479622115_0011,pyspark,idle,Link,Link,✔


In [3]:
sc.install_pypi_package("pandas")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting pandas
  Using cached https://files.pythonhosted.org/packages/52/3f/f6a428599e0d4497e1595030965b5ba455fd8ade6e977e3c819973c4b41d/pandas-0.25.3-cp36-cp36m-manylinux1_x86_64.whl
Collecting python-dateutil>=2.6.1
  Using cached https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl
Installing collected packages: python-dateutil, pandas
Successfully installed pandas-0.25.3 python-dateutil-2.8.1

In [4]:
sc.install_pypi_package("matplotlib")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting matplotlib
  Using cached https://files.pythonhosted.org/packages/4e/11/06958a2b895a3853206dea1fb2a5b11bf044f626f90745987612af9c8f2c/matplotlib-3.1.2-cp36-cp36m-manylinux1_x86_64.whl
Collecting kiwisolver>=1.0.1
  Using cached https://files.pythonhosted.org/packages/f8/a1/5742b56282449b1c0968197f63eae486eca2c35dcd334bab75ad524e0de1/kiwisolver-1.1.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1
  Using cached https://files.pythonhosted.org/packages/c0/0c/fc2e007d9a992d997f04a80125b0f183da7fb554f1de701bbb70a8e7d479/pyparsing-2.4.5-py2.py3-none-any.whl
Collecting cycler>=0.10
  Using cached https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl
Installing collected packages: kiwisolver, pyparsing, cycler, matplotlib
Successfully installed cycler-0.10.0 kiwisolver-1.1.0 matplotlib-3.1.2 pyparsing-2.4.5

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import re
from pyspark.sql.functions import col
from pyspark.sql.functions import sum as spark_sum

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
def run_sql(statement):
    try:
        result = sqlContext.sql(statement)
    except Exception as e:
        print(e.desc, '\n', e.stackTrace)
        return
    return result

def run_sql_pandas(statement):
    return run_sql(statement).toPandas()


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Read the data and inspect its structure

In [7]:
raw_df = spark.read.json("s3://topic-sentiment-1/combined/2019-11-26_googlebot_lt.json").cache()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [8]:
raw_df.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- authors: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- date_download: string (nullable = true)
 |-- date_modify: string (nullable = true)
 |-- date_publish: string (nullable = true)
 |-- description: string (nullable = true)
 |-- filename: string (nullable = true)
 |-- image_url: string (nullable = true)
 |-- language: string (nullable = true)
 |-- localpath: string (nullable = true)
 |-- source_domain: string (nullable = true)
 |-- text: string (nullable = true)
 |-- title: string (nullable = true)
 |-- title_page: string (nullable = true)
 |-- title_rss: string (nullable = true)
 |-- url: string (nullable = true)

Here is an explaination for these columns.

* authors is an array of author names.  It is unclear if we will use this in the analysis.
* date_download and date_modify are both artifacts of when the article was crawled.  We won't use this in the analysis, but it may be useful for debugging.
* date_publish is the actual date shown on the article.  Because we will be looking at trends over time, this column is important.
* description of the article.  We will use this as a substitute for text when text is missing.
* filename is the file where the crawl was saved. We won't use this in analysis.
* image_url is a main image for an article, although this is sometimes a placeholder image. We won't be using this in the analysis.
* language is a two-letter code for the language of the article.  For simplicity, we will only use articles in English (language = 'en').  There is more analysis on this later.
* localpath is the full path for the file that was stored during crawling.  We won't be using this.
* source_domain is the publication of the article.  This may be used in analysis.
* text is the main text of the article.
* title is the title for the article.  We will drop articles without titles.
* title_page is the title you would see in the browser tab or header. We won't use this in the analysis.
* title_rss is the title in RSS for articles collected in that manner.  We won't use this in the analysis.
* url is the URL of the article.  This probably won't be used in the analysis, but it will be useful to keep around.

So in the clean data we want to drop: filename, image_url, localpath, title_page, and title_rss.

In [9]:
columns_to_drop = ['filename', 'image_url', 'localpath', 'title_page', 'title_rss']
raw_df = raw_df.drop(*columns_to_drop)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Data types

The date_publish column is a string.  We need a working date column, but we probably want to keep the original column so we don't have to worry about precision problems when comparing time stamps.

In [10]:
raw_df = raw_df.withColumn("published", (col("date_publish").cast("timestamp")))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Look for missing data

We will use a method that was shown in one of the mini-projects to calculate the total number of blank values for each column.

In [11]:
def count_null(col_name):
    return spark_sum(col(col_name).isNull().cast('integer')).alias(col_name)

# Build up a list of column expressions, one per column.
exprs = [count_null(col_name) for col_name in raw_df.columns]

# Run the aggregation. The *exprs converts the list of expressions into
# variable function arguments.

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [12]:
raw_df.agg(*exprs).toPandas().T

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                    0
authors             0
date_download       0
date_modify         0
date_publish    43223
description    135342
language       207855
source_domain       0
text            71495
title             322
url                 0
published       43235

BTW, the T is for Transform, which is done because it is easier to read the values this way.

Note that further analysis showed that title_rss often has the string value 'NULL'.  We would need to do some work to convert these to NaN if we cared about this column.  This doesn't appear this is a problem for any other columns.

Becuase only a small number of articles are missing title, we will exclude those articles.

We need to do more analysis to determine what other articles must be dropped due to missing data.

## Set up a database we can query

In [13]:
run_sql('drop database if exists topic cascade')
run_sql('create database topic')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

DataFrame[]

In [14]:
permanent_table_name = 'topic.articles'
raw_df.write.format("parquet").saveAsTable(permanent_table_name)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Missing text

There are 70K articles without text and 135K articles without descriptions.  What if we were to combine the two into one field?  Would that help increase the number of articles we can analyze?

In [15]:
run_sql_pandas('SELECT count(url) FROM topic.articles WHERE text IS NULL AND description IS NULL')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

   count(url)
0        5881

Yes, we only have 6K articles without both text and description.  In our data clean up, we will use text, unless it is missing, in which case we will substitute description if it is present.

### Language analysis

First, let's see what languages are in use.

In [16]:
run_sql_pandas('SELECT language, count(source_domain) FROM topic.articles GROUP BY language')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

  language  count(source_domain)
0       en                695169
1       vi                 12878
2     None                207855
3       tr                  1552
4       es                     2
5       ar                     1
6       fa                 14173
7       zh                  7261
8       ja                    17

In [17]:
source_language_pd = run_sql_pandas('''
    SELECT source_domain, language, count(*) FROM topic.articles
    GROUP BY source_domain, language
''')
source_language_pd.sort_values(by='source_domain').head(300)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

             source_domain language  count(1)
57              apnews.com     None     73877
28               axios.com       en     46207
16                 bbc.com       en      5721
27                 bbc.com     None     23169
50                 bbc.com       fa     14173
8                  bbc.com       zh      7261
24                 bbc.com       tr      1552
35                 bbc.com       vi     12878
13           bloomberg.com       en         4
20              boston.com     None       189
44              boston.com       en     38049
33         bostonglobe.com       en        91
7            breitbart.com       en     17988
42             cbsnews.com       en     21258
4     chicago.suntimes.com       en        44
52      chicagotribune.com       en     42206
55               chron.com     None     33764
25               chron.com       en        15
49                cnbc.com       en       555
58                cnbc.com     None     14483
29            dailykos.com     Non

For simplicity, we will use only articles in English.   But what about all those articles without a language?   Let's do some analysis to see if we can set the language value.

In [18]:
run_sql('''
        SELECT source_domain, count(*) FROM topic.articles 
        WHERE language IS NULL 
        GROUP BY source_domain
        ''').toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

      source_domain  count(1)
0            ft.com        20
1         msnbc.com       887
2   theatlantic.com         9
3       nbcnews.com         1
4         slate.com      4956
5        boston.com       189
6           bbc.com     23169
7      dailykos.com     21609
8       mediate.com         1
9          time.com       125
10          npr.org         2
11      thehill.com     34763
12        chron.com     33764
13       apnews.com     73877
14         cnbc.com     14483

In [19]:
all_articles = run_sql_pandas('''
    SELECT source_domain, count(*) as total_articles FROM topic.articles
    GROUP BY source_domain
''')
all_articles.set_index('source_domain', inplace=True)
all_articles.sort_index().head(100)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                        total_articles
source_domain                         
apnews.com                       73877
axios.com                        46207
bbc.com                          64754
bloomberg.com                        4
boston.com                       38238
bostonglobe.com                     91
breitbart.com                    17988
cbsnews.com                      21258
chicago.suntimes.com                44
chicagotribune.com               42206
chron.com                        33779
cnbc.com                         15038
dailykos.com                     21609
denverpost.com                   41371
economist.com                    23403
fivethirtyeight.com              13677
forbes.com                         917
foxnews.com                      47253
ft.com                           19717
latimes.com                      20431
mediate.com                         35
msnbc.com                          887
nationalreview.com                  83
nbcnews.com              

In [20]:
in_english = run_sql_pandas('''
    SELECT source_domain, count(*) as in_english FROM topic.articles
    WHERE language = 'en'
    GROUP BY source_domain
''')
in_english.set_index('source_domain', inplace=True)
in_english.sort_index().head(100)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                        in_english
source_domain                     
axios.com                    46207
bbc.com                       5721
bloomberg.com                    4
boston.com                   38049
bostonglobe.com                 91
breitbart.com                17988
cbsnews.com                  21258
chicago.suntimes.com            44
chicagotribune.com           42206
chron.com                       15
cnbc.com                       555
denverpost.com               41371
economist.com                23403
fivethirtyeight.com          13677
forbes.com                     917
foxnews.com                  47253
ft.com                       19697
latimes.com                  20431
mediate.com                     34
nationalreview.com              83
nbcnews.com                  55437
npr.org                      19922
nypost.com                     361
nytimes.com                   3925
reason.com                    1942
reuters.com                     92
rt.com              

In [21]:
no_language = run_sql_pandas('''
    SELECT source_domain, count(*) as no_language
    FROM topic.articles
    WHERE language IS NULL
    GROUP BY source_domain
''')
no_language.set_index('source_domain', inplace=True)
no_language.sort_index().head(100)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                 no_language
source_domain               
apnews.com             73877
bbc.com                23169
boston.com               189
chron.com              33764
cnbc.com               14483
dailykos.com           21609
ft.com                    20
mediate.com                1
msnbc.com                887
nbcnews.com                1
npr.org                    2
slate.com               4956
theatlantic.com            9
thehill.com            34763
time.com                 125

In [22]:
summary = all_articles.merge(in_english, how='outer', on='source_domain').merge(no_language, how='outer', on='source_domain')
summary.sort_index().head(200)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                        total_articles  in_english  no_language
source_domain                                                  
apnews.com                       73877         NaN      73877.0
axios.com                        46207     46207.0          NaN
bbc.com                          64754      5721.0      23169.0
bloomberg.com                        4         4.0          NaN
boston.com                       38238     38049.0        189.0
bostonglobe.com                     91        91.0          NaN
breitbart.com                    17988     17988.0          NaN
cbsnews.com                      21258     21258.0          NaN
chicago.suntimes.com                44        44.0          NaN
chicagotribune.com               42206     42206.0          NaN
chron.com                        33779        15.0      33764.0
cnbc.com                         15038       555.0      14483.0
dailykos.com                     21609         NaN      21609.0
denverpost.com                   41371  

If language is NaN for all articles and we know the source to be in English, we can assume the language is English.

If only a tiny percentage of articles are missing language for a given publication, we can just ignore those as not worth our time.

But if a significant number of articles (but not all articles) have no language set, we need to examine some of the articles to see if we can make assumptions.    So next, we pull out URLs for some of the articles with no language set for a given publication.

In [23]:
pd.set_option('display.max_colwidth', -1)
run_sql_pandas('''
    SELECT url FROM topic.articles
    WHERE source_domain = 'bbc.com' AND language IS NULL
''')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                                                              url
0      https://www.bbc.com/urdu/sport/2009/11/091102_india_vs_aussi_4_zs         
1      https://www.bbc.com/urdu/india/2011/07/110711_train_accident_update       
2      https://www.bbc.com/urdu/pakistan/2010/12/101215_sasoli_demos_rza         
3      https://www.bbc.com/urdu/pakistan/2011/01/110112_pics_biden_pakistan_zs   
4      https://www.bbc.com/urdu/pakistan/2011/10/111011_pak_prisoners_pics_nj    
...                                                                       ...    
23164  https://www.bbc.com/urdu/pakistan/2010/10/101031_baat_say_bbat_nj         
23165  https://www.bbc.com/urdu/world/2011/02/110218_us_decline_justine_ra       
23166  https://www.bbc.com/urdu/pakistan/2011/02/110210_mardan_attack_analysis_zz
23167  https://www.bbc.com/urdu/opinion/2010/01/100109_gillani_pm_fm             
23168  https://www.bbc.com/urdu/world/2011/07/110715_us_recognise_rebels_zs      

[23169 rows x 1

I repeated this analyisis for several publications.  Of these, it is only for bbc.com that we cannot make an assumption the articles with no language set are in English.  

### What about missing publish dates?

In [24]:
raw_df.dtypes

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[('authors', 'array<string>'), ('date_download', 'string'), ('date_modify', 'string'), ('date_publish', 'string'), ('description', 'string'), ('language', 'string'), ('source_domain', 'string'), ('text', 'string'), ('title', 'string'), ('url', 'string'), ('published', 'timestamp')]

In [25]:
no_publish_date = run_sql_pandas('''
    SELECT source_domain, count(*) as no_publish_date
    FROM topic.articles
    WHERE date_publish IS NULL
    GROUP BY source_domain
''')
no_publish_date.set_index('source_domain', inplace=True)
no_publish_date.head(50)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                 no_publish_date
source_domain                   
ft.com           55             
reuters.com      64             
theatlantic.com  694            
nbcnews.com      12393          
wsj.com          142            
slate.com        8              
boston.com       27555          
mediate.com      18             
foxnews.com      1400           
time.com         850            
npr.org          4              
thehill.com      39             
cnbc.com         1

In [26]:
date_issues = all_articles.merge(no_publish_date, how='inner', on='source_domain')
date_issues.head(50)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                 total_articles  no_publish_date
source_domain                                   
ft.com           19717           55             
reuters.com      92              64             
theatlantic.com  38181           694            
nbcnews.com      55438           12393          
wsj.com          490             142            
slate.com        46806           8              
boston.com       38238           27555          
mediate.com      35              18             
foxnews.com      47253           1400           
time.com         47323           850            
npr.org          19924           4              
thehill.com      34763           39             
cnbc.com         15038           1

In [27]:
run_sql_pandas('''
    SELECT url FROM topic.articles
    WHERE source_domain = 'nbcnews.com' AND date_publish IS NULL
''')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                                                                                                                           url
0      https://www.nbcnews.com/now/video/dr-jill-biden-joe-handled-criticism-with-confidence-and-integrity-65188421623                        
1      https://www.nbcnews.com/video/lemurs-are-the-worlds-most-endangered-mammal-258861635884                                                
2      https://www.nbcnews.com/card/trump-health-care-n727366                                                                                 
3      https://www.nbcnews.com/meet-the-press/video/castro-white-supremacy-is-brewing-throughout-the-country-65287749983                      
4      https://www.nbcnews.com/video/jennifer-cramblett-i-cant-let-them-do-this-to-another-family-336553539944                                
...                                                                                                        ...                                

Because of the large percentage of boston.com articles with no date, it is probably not a good publication to use.

nbcnews.com also has a large percentage of articles with no date, but these appear to be particular types of articles, such as those with stand-alone videos.  We can probably still use nbcnews.com, but we will discard articles without a publish date.

In [28]:
run_sql_pandas('''
    SELECT url, count(*) FROM topic.articles
    GROUP BY url
    HAVING count(*) > 1
''')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                                                                                                                                                                                                             url  count(1)
0  https://www.washingtonpost.com/local/virginia-politics/jens-soering-convicted-of-sensational-1985-double-murders-released-by-virginia-parole-board/2019/11/25/42e2790e-0fc6-11ea-bf62-eadd5d11f559_story.html  2       
1  https://www.npr.org/                                                                                                                                                                                           2       
2  https://slate.com/culture/2010/11/the-nutcracker-3d-screws-up-the-classic-story-in-every-conceivable-way.html                                                                                                  2

We have duplicates that we will need to eliminate in the data cleaning.