## Basic analysis of ask_science subreddit data

In this exercise, you’ll learn how to navigate both qualitative and quantitative information. We’ll explore how people are engaging with the topic of vaccinations in social media by analyzing the r/askscience data from last week and ask targeted questions from our data set. First we’ll try to better understand how to handle text-based information by searching this subreddit for submissions that feature the word stem vaccin (in words like vaccinate and vaccination). Then we’ll compare the engagement metrics—the combined number of comments and upvotes—of the vaccine posts to those of non-vaccine posts.


We'll repeat the step we did last week (exploring the data set) and then go through various steps (filtering, categorizing and doing basic math) to answer a question. 

### Clarifying Our Research Objective
For this exercise, we’ll use online conversations from the r/askscience sub- reddit, a popular forum for Reddit users to ask and answer questions related to science, to measure how vigorously vaccinations are discussed on the web.

While Reddit users are not representative of the entire US population, we can try to understand how controversial this topic is on this particular forum by looking at it relative to other topics on the platform. The key here, as in any other examination of the social web, is to acknowledge and under- stand the specificity of each data set we examine.

We’ll begin by asking a very rudimentary question: do _r/askscience Reddit_ submissions that include variations of the word `vaccination`, `vaccine`, or `vaccinate` elicit more activity than r/askscience subreddit submissions that don’t?

As usual, let's start by importing pandas:


In [1]:
import pandas as pd

Then, let's import our data:

In [2]:
%%time
ask_science_data = pd.read_csv('../data/askscience_submissions.csv')
print(len(ask_science_data))




618576
CPU times: user 6.84 s, sys: 753 ms, total: 7.59 s
Wall time: 7.84 s


Let's look at the top 10 entries of our data. 

In [3]:
ask_science_data.head()

Unnamed: 0,approved_at_utc,archived,author,author_cakeday,author_flair_css_class,author_flair_text,banned_at_utc,brand_safe,can_gild,can_mod_post,...,subreddit_id,subreddit_type,suggested_sort,thumbnail,thumbnail_height,thumbnail_width,title,ups,url,whitelist_status
0,,True,vertexoflife,,,,,,,,...,t5_2qm4e,,,self,,,Why are so many sodium-based products used in ...,6.0,http://www.reddit.com/r/askscience/comments/1u...,
1,,True,[deleted],,,,,,,,...,t5_2qm4e,,,default,,,Why do almost all living organisms have a long...,1.0,http://www.reddit.com/r/askscience/comments/1u...,
2,,True,SwoccerFields,,,,,,,,...,t5_2qm4e,,,default,,,Can anybody verify this piece of text? I haven...,2.0,http://www.reddit.com/r/askscience/comments/1u...,
3,,True,[deleted],,,,,,,,...,t5_2qm4e,,,default,,,Is it possible for light to travel in a straig...,1.0,http://www.reddit.com/r/askscience/comments/1u...,
4,,True,[deleted],,,,,,,,...,t5_2qm4e,,,default,,,Why is a nuclear bomb more effective detonated...,1.0,http://www.reddit.com/r/askscience/comments/1u...,


In [4]:
ask_science_data.tail()

Unnamed: 0,approved_at_utc,archived,author,author_cakeday,author_flair_css_class,author_flair_text,banned_at_utc,brand_safe,can_gild,can_mod_post,...,subreddit_id,subreddit_type,suggested_sort,thumbnail,thumbnail_height,thumbnail_width,title,ups,url,whitelist_status
618571,,False,abattlescar,,,,,True,,,...,t5_2qm4e,public,,default,,,How long of a lever would be required to give ...,,https://www.reddit.com/r/askscience/comments/7...,all_ads
618572,,False,jbeale53,,,,,True,,,...,t5_2qm4e,public,,default,,,If an alien starship that is about the size of...,,https://www.reddit.com/r/askscience/comments/7...,all_ads
618573,,False,M1k35n4m3,,,,,True,,,...,t5_2qm4e,public,,default,,,Have maps/globes been updated to account for t...,,https://www.reddit.com/r/askscience/comments/7...,all_ads
618574,,False,FalconX88,,,,,True,,,...,t5_2qm4e,public,,default,,,Is it common/possible to smell something that ...,,https://www.reddit.com/r/askscience/comments/7...,all_ads
618575,,False,[deleted],,,,,True,,,...,t5_2qm4e,public,,default,,,Coriolis effect?,,https://www.reddit.com/r/askscience/comments/7...,all_ads


How big is our dataset?

In [5]:
print(len(ask_science_data))

618576


How many columns does it have and what columns are there?

In [6]:
print(len(ask_science_data.columns))
ask_science_data.columns

62


Index(['approved_at_utc', 'archived', 'author', 'author_cakeday',
       'author_flair_css_class', 'author_flair_text', 'banned_at_utc',
       'brand_safe', 'can_gild', 'can_mod_post', 'contest_mode', 'created',
       'created_utc', 'crosspost_parent', 'crosspost_parent_list',
       'distinguished', 'domain', 'edited', 'from', 'from_id', 'from_kind',
       'gilded', 'hidden', 'hide_score', 'id', 'is_crosspostable',
       'is_reddit_media_domain', 'is_self', 'is_video', 'link_flair_css_class',
       'link_flair_text', 'locked', 'media', 'media_embed', 'name',
       'num_comments', 'over_18', 'parent_whitelist_status', 'permalink',
       'pinned', 'post_hint', 'preview', 'quarantine', 'retrieved_on', 'saved',
       'score', 'secure_media', 'secure_media_embed', 'selftext', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_type',
       'suggested_sort', 'thumbnail', 'thumbnail_height', 'thumbnail_width',
       'title', 'ups', 'url', 'whitelist_status'],
    

What data types are in it?

In [7]:
ask_science_data.dtypes

approved_at_utc             object
archived                    object
author                      object
author_cakeday              object
author_flair_css_class      object
author_flair_text           object
banned_at_utc               object
brand_safe                  object
can_gild                    object
can_mod_post                object
contest_mode                object
created                     object
created_utc                 object
crosspost_parent            object
crosspost_parent_list       object
distinguished               object
domain                      object
edited                      object
from                       float64
from_id                    float64
from_kind                  float64
gilded                     float64
hidden                      object
hide_score                  object
id                          object
is_crosspostable            object
is_reddit_media_domain      object
is_self                     object
is_video            

### Analysis: Asking the right questions

Let's get back to our research question:
`Do r/askscience Reddit submissions that include variations of the word vaccination, vaccine, or vaccinate elicit more activity than r/askscience subreddit submissions that don’t?`

For that we need to do the following:

1. Filter and group our data into two data frames. The first data frame will contain all submissions that use the words vaccine, vaccinate, or vaccination. The second data frame, which we’ll compare to the first, will contain the submissions that don’t mention those words.
2. Run simple calculations on each data frame. Summarizing our data by finding mean or median engagement counts (in this analysis, engage- ment counts are represented by the combined number of comments and upvotes), can help us better understand each subset of the r/askscience data received, and formulate an answer to our research question.

### Defining your universes

We can start by looking at posts that contain the word stem `vaccinat`. This allows us to compare vaccination-related posts to all other posts in that universe of data/population. 

In [8]:
ask_science_data['vaccinations'] = ask_science_data['title'].str.contains('vaccinat')
ask_science_data['vaccinations'].value_counts()

False    617958
True        600
Name: vaccinations, dtype: int64

In [9]:
ask_science_vaccination =  ask_science_data[ask_science_data['vaccinations'] == True]

In [10]:
ask_science_vaccination.head()

Unnamed: 0,approved_at_utc,archived,author,author_cakeday,author_flair_css_class,author_flair_text,banned_at_utc,brand_safe,can_gild,can_mod_post,...,subreddit_type,suggested_sort,thumbnail,thumbnail_height,thumbnail_width,title,ups,url,whitelist_status,vaccinations
3865,,True,[deleted],,,,,,,,...,,,self,,,How exactly are vaccines made (by what process...,1.0,http://www.reddit.com/r/askscience/comments/1u...,,True
4199,,True,yupko,,,,,,,,...,,,default,,,"Are there any drawbacks to vaccination at all,...",1.0,http://www.reddit.com/r/askscience/comments/1u...,,True
5095,,True,ducky-box,,,,,,,,...,,,self,,,If a parent decided not to vaccinate their chi...,20.0,http://www.reddit.com/r/askscience/comments/1v...,,True
5327,,True,[deleted],,,,,,,,...,,,default,,,If being vaccinated against the flu is so impo...,1.0,http://www.reddit.com/r/askscience/comments/1v...,,True
5715,,True,Cytosolic,,,,,,,,...,,,default,,,What's a realistic outcome for the current tre...,1.0,http://www.reddit.com/r/askscience/comments/1v...,,True


In [11]:
ask_science_vaccination.sort_values(by='ups', ascending = False)

Unnamed: 0,approved_at_utc,archived,author,author_cakeday,author_flair_css_class,author_flair_text,banned_at_utc,brand_safe,can_gild,can_mod_post,...,subreddit_type,suggested_sort,thumbnail,thumbnail_height,thumbnail_width,title,ups,url,whitelist_status,vaccinations
177230,,True,Schmitty422,,,,,,,,...,,,self,,,I keep hearing about outbreaks of measles and ...,3621.0,http://www.reddit.com/r/askscience/comments/2t...,,True
82615,,True,umichscoots,,,,,,,,...,,,self,,,"If unvaccinated people are causing outbreaks, ...",619.0,http://www.reddit.com/r/askscience/comments/28...,,True
10563,,True,planejane,,,,,,,,...,,,self,,,How are combined vaccinations established? Who...,510.0,http://www.reddit.com/r/askscience/comments/1v...,,True
174803,,True,andyjeff76,,,,,,,,...,,,self,,,Is the rise in Measles cases the result of the...,387.0,http://www.reddit.com/r/askscience/comments/2s...,,True
87383,,True,werd678,,,,,,,,...,,,self,,,Could you acquire a vaccination through a bloo...,304.0,http://www.reddit.com/r/askscience/comments/29...,,True
159710,,True,rsh412,,,,,,,,...,,,self,,,How do diseases like mumps spread through vacc...,160.0,http://www.reddit.com/r/askscience/comments/2p...,,True
413930,,False,majorlazed,,,,,,,,...,,,self,,,Why are Polio and Smallpox eradicated while ot...,75.0,https://www.reddit.com/r/askscience/comments/4...,,True
10326,,True,daats_end,,,,,,,,...,,,self,,,Do unvaccinated children really run a signific...,64.0,http://www.reddit.com/r/askscience/comments/1v...,,True
85179,,True,[deleted],,,,,,,,...,,,default,,,Why is not vaccinating some children a threat ...,57.0,http://www.reddit.com/r/askscience/comments/29...,,True
220286,,False,InteriorEmotion,,,,,,,,...,,,self,,,If antibodies are transferred from mother to c...,52.0,http://www.reddit.com/r/askscience/comments/33...,,True


In [12]:
ask_science_data.iloc[4]


approved_at_utc                                                          NaN
archived                                                                True
author                                                             [deleted]
author_cakeday                                                           NaN
author_flair_css_class                                                   NaN
author_flair_text                                                        NaN
banned_at_utc                                                            NaN
brand_safe                                                               NaN
can_gild                                                                 NaN
can_mod_post                                                             NaN
contest_mode                                                             NaN
created                                                         1388535304.0
created_utc                                                      1.38854e+09

In [13]:
ask_science_data['title']

0         Why are so many sodium-based products used in ...
1         Why do almost all living organisms have a long...
2         Can anybody verify this piece of text? I haven...
3         Is it possible for light to travel in a straig...
4         Why is a nuclear bomb more effective detonated...
5         Can the different types of tastes be mixed tog...
6         What is happening in my brain when I accidenta...
7                         Can I grow a Carrot in the ocean?
8         How much better/worse is our brain compared to...
9                 What's the smallest thing an eye can see?
10        Can anybody verify this piece of text? I haven...
11        What are the chances of an Alien species speak...
12        Is there any medical treatment that we know wo...
13                      How does inbreeding affect animals?
14        It is generally accepted that one should be hy...
15        What are the opinions of Clinical Research Org...
16        Can someone explain why this n

### Answering your question by doing simple math

Let's:
- narrow down our data
- clean our data to drop NA values

In [14]:
columns = ['title','ups', 'num_comments']
ask_science_reduced = ask_science_data[columns]


In [15]:
ask_science_reduced.head()

Unnamed: 0,title,ups,num_comments
0,Why are so many sodium-based products used in ...,6.0,2.0
1,Why do almost all living organisms have a long...,1.0,0.0
2,Can anybody verify this piece of text? I haven...,2.0,0.0
3,Is it possible for light to travel in a straig...,1.0,0.0
4,Why is a nuclear bomb more effective detonated...,1.0,0.0


In [16]:
len(ask_science_reduced)

618576

### To drop of not to drop null values, that is the question

Deciding whether to remove null values or fill them depends on your data set and how you want to answer your research question. For example,if we wanted to get the median number of comments for our entire data set, we might ask whether it’s safe to assume that missing values simply mean that there were no comments on the submission. If we decide it is, we can fill those values with a 0 and calculate accordingly.

Depending on the number of rows that contain missing values or “empty cells,” the median number of comments may shift significantly. However, because this data set does sometimes record the number of comments or upvotes as zeros and sometimes as null values, we can’t automatically assume that rows that contain null values for those columns should be treated as zeros (if null values represented zeros, it may be reasonable to assume that the data set would not contain any actual zeros). Instead, maybe this is data that our archivist was unable to capture; maybe the posts were deleted before he could gather that information; or maybe those metrics were introduced for some of those years but not for others. Thus, for the sake of our exercise, we should work with the data we do have and drop the rows of data that do not contain
a value for the ups or the num_comments columns.

Let's drop some of the NaN values like we did last week!

In [17]:
ask_science_dropped_rows = ask_science_reduced.dropna(subset=['ups', 'num_comments'])
len(ask_science_dropped_rows)

478260

In [18]:
%%time
ask_science_dropped_rows['contains_vaccin'] = ask_science_dropped_rows['title'].str.contains('vaccin')

CPU times: user 294 ms, sys: 5.45 ms, total: 300 ms
Wall time: 310 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Let's create two data frames:
- one that contains vaccination related posts
- one that contains all other posts

In [19]:
ask_science_data_vaccinations = ask_science_dropped_rows[ask_science_dropped_rows['contains_vaccin'] == True]
ask_science_data_no_vaccinations = ask_science_dropped_rows[ask_science_dropped_rows['contains_vaccin'] == False]


Let's add up the reactions to find an overall tally of activity by which we measure the 'popularity' of a post.

In [20]:
%%time
ask_science_data_vaccinations['combined_reactions'] = ask_science_data_vaccinations['ups']+ask_science_data_vaccinations['num_comments']
ask_science_data_no_vaccinations['combined_reactions'] = ask_science_data_no_vaccinations['ups']+ask_science_data_no_vaccinations['num_comments']

CPU times: user 40.9 ms, sys: 2.04 ms, total: 43 ms
Wall time: 44 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Now we can sort the data frames by number of combined reactions. 

In [21]:
ask_science_data_vaccinations.sort_values(by='combined_reactions', ascending=False)


Unnamed: 0,title,ups,num_comments,contains_vaccin,combined_reactions
177230,I keep hearing about outbreaks of measles and ...,3621.0,662.0,True,4283.0
133287,Why are we afraid of making super bugs with an...,2377.0,548.0,True,2925.0
143663,"Psychologically speaking, how can a person con...",1775.0,461.0,True,2236.0
82615,"If unvaccinated people are causing outbreaks, ...",619.0,146.0,True,765.0
10563,How are combined vaccinations established? Who...,510.0,56.0,True,566.0
174803,Is the rise in Measles cases the result of the...,387.0,58.0,True,445.0
180101,Is Mercury all that bad for you? Why is it pre...,312.0,78.0,True,390.0
87383,Could you acquire a vaccination through a bloo...,304.0,55.0,True,359.0
412308,Why are vaccines mostly limited to providing i...,199.0,58.0,True,257.0
475068,"How are infectious organisms ""weakened"" for li...",210.0,31.0,True,241.0


In [22]:
ask_science_data_no_vaccinations.sort_values(by='combined_reactions', ascending=False)

Unnamed: 0,title,ups,num_comments,contains_vaccin,combined_reactions
477896,"If we could drain the ocean, could we breath o...",18789.0,1018.0,False,19807.0
456459,If we detonated large enough of a nuclear bomb...,11690.0,1353.0,False,13043.0
461793,"In terms of a percentage, how much oil is left...",9305.0,1624.0,False,10929.0
457979,How do you optimally place two or more Hot Poc...,9308.0,936.0,False,10244.0
457573,Carbon in all forests is 638 GtC. Annual carbo...,8856.0,834.0,False,9690.0
350040,Gravitational Wave Megathread,6778.0,2799.0,False,9577.0
455717,"Why do flames take a clearly defined form, rat...",9109.0,344.0,False,9453.0
478215,If my voice sounds different to me than it doe...,8657.0,511.0,False,9168.0
471827,If fire is a reaction limited to planets with ...,8099.0,876.0,False,8975.0
465628,In this gif of white blood cells attacking a p...,8155.0,639.0,False,8794.0


How about doing simple math with our data frames?

In [23]:
print(ask_science_data_vaccinations['combined_reactions'].describe())
print(ask_science_data_no_vaccinations['combined_reactions'].describe())

count    1272.000000
mean       13.723270
std       162.056708
min         0.000000
25%         1.000000
50%         1.000000
75%         2.000000
max      4283.000000
Name: combined_reactions, dtype: float64
count    476988.000000
mean         16.585008
std         197.908268
min           0.000000
25%           1.000000
50%           1.000000
75%           2.000000
max       19807.000000
Name: combined_reactions, dtype: float64


In [24]:
print(ask_science_data_vaccinations['combined_reactions'].mean())
print(ask_science_data_no_vaccinations['combined_reactions'].mean())

13.723270440251572
16.58500842788498


In [25]:
print(ask_science_data_vaccinations['combined_reactions'].median())
print(ask_science_data_no_vaccinations['combined_reactions'].median())

1.0
1.0
