# Programming Assignment 1 - Understanding upvotes on top reddit posts

After completing this project, you will be able to do the following:

- Collect and save reddit data using the reddit API (through the ```praw``` library)
- Be able to conduct descriptive analyses, via manipulation of data stored in a pandas dataframe, and via the creation and exploration of graphs, of the number of upvotes of reddit comments
- Be able to conduct a linear regression to help understand the factors associated with a top post having many upvotes on reddit
- **574 Only**: Be able to implement additional feature sets and/or a new model and describe why those decisions were made and what their effects were on performance

# Resources you can use to complete this assignment (a COMPLETE list)

**NOTE: You ARE allowed to use Google to find things that fit this list (i.e. it is often easy to google something like "plotly draw line graph" to find the right part of the plotly documentation).**

- Anything linked to in this article
- Anything linked to from the course web page
- Any materials from another online course taught at a university (**if you use this, you MUST provide a link to the exact document used**)
- Anything posted by Kenny, Navid, or Yincheng on Piazza

# Setup

- For this assignment, you will need to install the PRAW library for scraping reddit data.

# Grading

There are three parts to the grading:

1. **Written Report (60 points)**: You will submit a PDF report that provides answers to questions here, and that contains plots we request.  These same questions are also posted in the assignment PDF, for convenience. **Again, the questions in the Assignment PDF and here are the same (for the written report), we just put them in both places for convenience.**.

2. **Saved file from Part 1.3 (10 points)**: See below for details. 

3. **Coding spot checks (30 points)** - We will select 6 problems to spot check. This means that we will check to make sure that your code is written in a reasonable way and that it obtains the desired results when we run your code. For example, your code should not be written in a way that makes it exceedingly slow, e.g. by using for loops where a vectorized approach would be applicable. We will *not* tell you which problems we are spot checking.


As such, you will submit, one member of your group will subit as a zip file on UBLearns, a ```.zip``` file that contains 3 things:
- Your completed jupyter notebook.
- Your written report, answering all questions asked here (and copied in the assignment PDF)
- Your saved file from Part 1.3 below

In [3]:
# This is a comment in the code. All comments in python are preceeded by a pound sign
# Comments can be plain English, because the computer ignores them when running the code.

# This should be all the imports you need for this project.

# The line of code below this comment imports code written by other people in the form of the 
# praw library
import praw
import numpy as np
import pandas as pd
import plotly.express as px
import pprint

# Part 1: Data collection

## Step 1: Creating a reddit account

If you don't have one already, the first thing you'll need to do is go to [reddit](http://www.reddit.com/) and create a reddit account.

## Step 2: Creating a reddit app

Now, we're going to create a reddit app. Make sure you're signed in to your reddit account, and then go to the [app page](https://ssl.reddit.com/prefs/apps/).  From here, click on the "create an app" button. <b>Make sure that you've selected the "script" option in the checkbox</b>, and then fill in a name and description. For the two URL fields, it doesn't really matter what you put. Your input should look something like mine.

<img width="1000px" src="./reddit_tut_0.png" style="max-width:95%;border:3px solid black;"></img>
    
Cool! Now hit "create app" button.

## Step 3: Scraping some reddit data

### 3a - setting up your API credentials
We're now going to pull down some real, live reddit posts and take a look at them! To do so, we're going to need some information from our app.  Reddit uses this information to keep track of who is accessing their data, and in what ways they're accessing it.  In other words, reddit wants to make sure it knows who might be doing bad things with their data or to their platform, so they can shut those apps down. We'll return to that point in a bit.  But, for now, you'll need to edit the code below to enter in your ```client_id```, your ```client_secret```, and your ```username```  in order for you to be able to collect data.  ***Note - PLEASE TRY TO REMEMBER TO REMOVE YOUR SECRET INFORMATION BEFORE SUBMITTING THE ASSIGNMENT***. You can find the first two on the app creation page, where the red boxes are in the image below:

<img width="1000px" src="reddit_tut_1.png" style="max-width:95%;border:3px solid black;"></img>

Enter your information between the appropriate quotation marks in the (python) code below, replacing the phrase ```ENTER_YOUR_XXX_HERE```.

 ***Make sure to push ```SHIFT+ENTER``` after you've changed the code!*** 

In [4]:
your_client_id = 'CLIENT_ID'
your_client_secret = 'CLIENT_SECRET'
your_username = 'USER_NAME'
your_user_agent = "USER_AGENT"

### 3b - Selecting subreddits

Ok, we're almost all set up to scrape! to do so, we're going to use a python library called [praw](https://praw.readthedocs.io/).  Praw is a relatively powerful tool, allowing you to do a bunch of cool things with the reddit API.  For this assignment, you're just going to do the basics:

Specifically, **<span style="color: red;">create a variable called ```subreddits``` in the code box below. The variable should point to a ```list``` data structure that has the names of 3 subreddits you want to pull data from.</span>**

In [6]:
subreddits = ['UBreddit','buffalobills','Buffalo']

### 3c - Setting up our authentication mechanism for our application

The last step before we start crawling is to set ourselves up to make authenticated calls to the reddit API. **<span style="color: red;">Use the PRAW library to create an instance of the class ```praw.Reddit``` that you can use to scrape the reddit API.</span>**


In [7]:
# Code for 3c should go here
reddit = praw.Reddit(client_id=your_client_id, user_agent = your_user_agent, client_secret = your_client_secret)

### 3d - Finally, some scraping!

OK! Now we can finally pull some data down from the reddit API!

<span style="color: red;">Use the ```praw``` library to pull down the **top 1000 posts of all time from EACH of the 3 subreddits you selected.** Note: You may not get all 1000, due to oddities with the reddit API. However, your code should specify that it *wants* up to 1000 posts.</span>

In [8]:
# Code for 3d should go here
top1000 = {}
for subreddit in subreddits:
    top1000[subreddit] = reddit.subreddit(subreddit).top(time_filter='all',limit=1000)

## Answering some questions about your data and the API

### 1.1 Understanding APIs
***Note, Part 1 questions can be answered by carefully reading the [documentation of the PRAW library carefully](https://praw.readthedocs.io/en/v3.6.2/pages/getting_started.html) and/or the [reddit API documentation](https://github.com/reddit-archive/reddit/wiki/API#rules).***

- **1.1.1** How many API calls were required to collect the submissions?
            Reddit allows API calls for 100 items at once, since we are requesting 1000 * 3 items, 30 API calls will be made. 
- **1.1.2** Why did we set the submission limit at 1000?
            Any single reddit listing will display atmost 1000 items. This limit prevents overworking on the databases.
- **1.1.3** How long, in minutes, would it take you to collect 1000 posts from 25 different subreddits? What about from 500 different subreddits? 
            To collect 1000 posts in a single subreddit, assuming each API call takes a second, it takes almost 30 seconds (max(api_delay, code_execution_time)). Thus for 25 subreddits, the API takes 12.5 minutes. For 500 subreddits, the API takes 250 minutes.  
*Hint: You'll have to consider how many API requests you are allowed to make in a given time period.*

### 1.2 Thinking about your sample

You collected (approximately) the top 1000 submissions from 3 different subreddits. 

- **1.2.1** Do you think these posts are representative of **all** the posts on that subreddit? (Yes or no, only) 
            No
- **1.2.2** Why or why not? That is, if you think so, why do you think there's not much sampling bias here? If not, what do you think might be different about these top posts than other posts?
            The top posts might be from popular redditors and might assign lower sampling probability to unpopular opinion resulting in a sampling bias.



## Saving out your data

Finally, we're going to save your data out and submit it. For this part, [this section of the API documentation may be useful](https://praw.readthedocs.io/en/latest/getting_started/quick_start.html#determine-available-attributes-of-an-object). Similarly, note that saving CSVs is sometimes easiest by first converting your data into a ```pandas``` dataframe, and then just calling ```.to_csv()```.

<span style="color: red;">You will save the data you have collected out to a CSV file. This CSV file should be called ```part1_data.csv```. The file should contain a column for each of the fields listed in the ```fields_to_capture``` list below. Additionally, you should save out the **author's name** (hint, the author attribute of the ```praw.Submission``` data structure is a ```praw.Redditor``` object. You will need to access that object to get the author name. Call this column ```author_name`` in your CSV file. </span>

**Note: Some posts will not have data for some of these columns. That is fine! You can make these fields blank in the CSV, then.**

In [10]:
# Don't change this!
fields_to_capture = [ 'created_utc', 
                     'is_crosspostable', 'is_self', 'is_video', 'locked', 'media_only', 'over_18',
                     'subreddit_id', 'subreddit_name_prefixed', 'subreddit_subscribers', 
                     'title', 'permalink', 
                     'total_awards_received', 'downs','gilded','num_comments', 'num_crossposts', 'num_reports', 
                     'ups']
fields_to_capture.append('author_name')

In [11]:
# Write the code here to save out the data
created_utc = []
is_crosspostable = []
is_self = []
is_video = []
locked = []
media_only = []
over_18 = []
subreddit_id = []
subreddit_name_prefixed = []
subreddit_subscribers = []
title = []
permalink = []
total_awards_received = []
downs = []
gilded = []
num_comments = []
num_crossposts = []
num_reports = []
ups = []
author_name = []
for s in subreddits:
    for t in top1000[s]:
        created_utc.append(t.created_utc)
        is_crosspostable.append(t.is_crosspostable)
        is_self.append(t.is_self)
        is_video.append(t.is_video)
        locked.append(t.locked)
        media_only.append(t.media_only)
        over_18.append(t.over_18)
        subreddit_id.append(t.subreddit_id)
        subreddit_name_prefixed.append(t.subreddit_name_prefixed)
        subreddit_subscribers.append(t.subreddit_subscribers)
        title.append(t.title)
        permalink.append(t.permalink)
        total_awards_received.append(t.total_awards_received)
        downs.append(t.downs)
        gilded.append(t.gilded)
        num_comments.append(t.num_comments)
        num_crossposts.append(t.num_crossposts)
        num_reports.append(t.num_reports)
        ups.append(t.ups)
        if(t.author):
            author_name.append(t.author.name)
        else:
            author_name.append("")
data_to_save = { 'created_utc' : created_utc, 
                     'is_crosspostable' : is_crosspostable, 'is_self' : is_self, 'is_video' : is_video, 'locked' : locked, 'media_only' : media_only, 'over_18' : over_18,
                     'subreddit_id' : subreddit_id, 'subreddit_name_prefixed' : subreddit_name_prefixed, 'subreddit_subscribers' : subreddit_subscribers, 
                     'title' : title, 'permalink' : permalink, 
                     'total_awards_received' : total_awards_received, 'downs' : downs, 'gilded' : gilded, 'num_comments' : num_comments, 'num_crossposts' : num_crossposts, 
                     'num_reports' : num_reports, 'ups' : ups, 'author_name' : author_name}

dframe = pd.DataFrame(data_to_save, columns= fields_to_capture)

dframe.to_csv ('part1_data.csv', index = False)

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readt

### 1.3 Grading for saving data

Submit your saved data from the reddit API in a file named ```part1_data.csv```. For grading for Part 1.3, we will check that:
- We can read in the saved file using ```pandas.read_csv()```
- The resulting file has data from three subreddits, approximately 1000 from each (give or take what the API decides to give you, which is out of your control).
- The resulting file has all the necessary columns, i.e. those listed in ```fields_to_capture```.


# Part 2 - Analyzing an existing dataset

For this section of the assignment, the entire class will use ```part2_data.csv```.  We will ask you to analyze these data in a variety of ways. Parts of this will be submitted in your written report, and other parts will be evaluated automatically.

**Part 2 data consists of the top 1000 (ish) posts from 24 different subreddits**.

In [12]:
part2_data = pd.read_csv("./part2_data.csv")
# part2_data.subreddit_name_prefixed.unique()
part2_data['subreddit_name_prefixed']
r_df = part2_data[part2_data['subreddit_name_prefixed'].isin(['r/Jokes'])]
r_df['ups'].mean()

41057.7813440321

## Part 2.1 - Quick Descriptive Analyses

Answers to each of the questions below should be provided in your written report. Additionally, we expect code to be written below that shows how you obtained answers to each of these questions. We will spot-check several of these. **Note, each of these can be answered using *only pandas*. 


### Univariate descriptive analyses
- **2.1.1** What are the names (```subreddit_name_prefixed```) of the 25 different subreddits that are in ```part2_data.csv```?
            'r/Jokes', 'r/news', 'r/science', 'r/WritingPrompts',
       'r/Showerthoughts', 'r/worldnews', 'r/todayilearned',
       'r/learnprogramming', 'r/announcements', 'r/funny', 'r/food',
       'r/sports', 'r/gadgets', 'r/aww', 'r/mildlyinteresting', 'r/memes',
       'r/technology', 'r/travel', 'r/books', 'r/gaming', 'r/cats',
       'r/conspiracy', 'r/PoliticalHumor', 'r/hockey'
       
- **2.1.2** How many reddit authors (```author_name```) have a post in more than one unique subreddit in ```part2_data.csv``` (e.g. they have a top post in both ```r/news``` and ```r/hockey```)?
            569

- **2.1.3** What is the mean number of upvotes (```ups```) for posts in ```r/Jokes```?
            41057.7813440321
- **2.1.4** What is the variance of the number of upvotes in ```r/news```?
            600707867.6203129
- **2.1.5** What is the standard deviation of the number of upvotes received across the entire dataset?
            43102.48447371037
- **2.1.6** (No code for this) Mathematically, what is the relationship between the standard deviation of the number of upvotes and the variance of upvotes?
            Standard deviation is the square root of the variance.
- **2.1.7** Which subreddit had the third highest median number of upvotes?
            r/aww                  115785.273092
### Conditional probability
- **2.1.8** What is the conditional probability of an author having a top post in ```r/news```, given that they have a top post in ```r/worldnews```?
          0.0028735632183908046


In [13]:
# Put your code for 2.1.1 here
part2_data['subreddit_name_prefixed'].unique()

array(['r/Jokes', 'r/news', 'r/science', 'r/WritingPrompts',
       'r/Showerthoughts', 'r/worldnews', 'r/todayilearned',
       'r/learnprogramming', 'r/announcements', 'r/funny', 'r/food',
       'r/sports', 'r/gadgets', 'r/aww', 'r/mildlyinteresting', 'r/memes',
       'r/technology', 'r/travel', 'r/books', 'r/gaming', 'r/cats',
       'r/conspiracy', 'r/PoliticalHumor', 'r/hockey'], dtype=object)

In [14]:
# Put your code for 2.1.2 here
multiple_subreddits_df = part2_data.groupby('author_name')['subreddit_name_prefixed'].unique()
#print(multiple_subreddits_df)
c = 0
for sr in multiple_subreddits_df:
    if(len(sr) > 1):
        c+=1
c

570

In [15]:
# Put your code for 2.1.3 here
r_df = part2_data[part2_data['subreddit_name_prefixed'].isin(['r/Jokes'])]
r_df['ups'].mean()

41057.7813440321

In [16]:
# Put your code for 2.1.4 here
r_df = part2_data[part2_data['subreddit_name_prefixed'].isin(['r/news'])]
r_df['ups'].var()

600707867.6203133

In [17]:
# Put your code for 2.1.5 here
part2_data['ups'].std()

43102.4844737104

In [19]:
# Put your code for 2.1.6 here


In [20]:
# Put your code for 2.1.7 here
part2_data.groupby('subreddit_name_prefixed')['ups'].mean().sort_values(ascending = False)

subreddit_name_prefixed
r/memes                139779.792339
r/funny                117773.470588
r/aww                  115785.273092
r/gaming               104358.502513
r/worldnews             88161.702213
r/todayilearned         87615.530550
r/news                  84578.367203
r/mildlyinteresting     83613.835341
r/Showerthoughts        72734.319459
r/PoliticalHumor        57198.236657
r/science               55911.997982
r/technology            48686.478261
r/sports                43736.108519
r/Jokes                 41057.781344
r/food                  27463.039275
r/cats                  26915.017034
r/books                 17782.996982
r/announcements         17625.624242
r/gadgets               12904.951170
r/WritingPrompts        12688.698699
r/hockey                10553.425000
r/conspiracy            10405.773509
r/travel                 8215.184184
r/learnprogramming       1789.061245
Name: ups, dtype: float64

In [21]:
#Code for 2.1.8
#Get the number of posts in r/news
df_news = part2_data[part2_data['subreddit_name_prefixed'].isin(['r/news'])]
news_count = df_news['subreddit_name_prefixed'].count()

#Get the number of posts in r/worldnews
df_world_news = part2_data[part2_data['subreddit_name_prefixed'].isin(['r/worldnews'])]
world_news_count = df_world_news['subreddit_name_prefixed'].count()

df_news_and_worldnews = part2_data.groupby('author_name')['subreddit_name_prefixed'].unique() 
news_and_worldnews_count = 0
for n in df_news_and_worldnews:
    if ('r/news') in n and ('r/worldnews') in n:
        news_and_worldnews_count+=1
total_entries = part2_data.subreddit_name_prefixed.count()
p_news = np.divide(news_count,total_entries)
p_world_news = np.divide(world_news_count,total_entries)
p_news_and_worldnews = np.divide(news_and_worldnews_count, total_entries)
p_news_given_worldnews = np.divide(np.multiply(p_news,p_news_and_worldnews),p_world_news)
print(p_news_given_worldnews)

0.0028735632183908046


## Part 2.2 - Plotting and the like
Where we have asked you to create a plot below, make sure to provide the resulting plot in your written report.  

**You are free to use whatever plotting software you wish! Although I personally think [seaborn](https://seaborn.pydata.org/), [plotly](https://plotly.com/python), or [altair](https://altair-viz.github.io/) will make these the easiest, the lecture notes also have examples using matplotlib.**


### Part 2.1 - Histograms

Plot a histogram for the distribution of upvotes for each subreddit separately (*hint: you will want to use "faceting" to make this easy on yourself*). **All plot titles and axis labels should be legible in the PDF you submit**.

- **2.2.1** - Submit your histogram image in your assignment
- **2.2.2** - Based on your histogram, which subreddit would you say is the *least* popular? (Note, there is more than one reasonable answer here. We are looking mostly for how you justify your response using the histogram)

            r/announcements has the least number of upvotes and hence can be said to be the least popular subreddits.


In [22]:
# Code for 2.2.1 here
import plotly.express as px
fig = px.histogram(part2_data, x='ups', facet_col = 'subreddit_name_prefixed', facet_col_wrap = 5, width=1080, height=720)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[1]))
fig.show()


### Ploting and using the empirical CDF

The *[empirical cumulative distribution function (eCDF)](https://en.wikipedia.org/wiki/Empirical_distribution_function)* is an empirical estimator for the CDF of a random variable. Below we have plotted for you (using ```plotly```) the eCDFs of the distribution of upvotes for three different subreddits. Using the plots below, answer the following questions:

**Note, you can use your mouse to scroll over the information in the plot, that will make answering these questions much easier!**

- **2.2.3** - **Approximately (within 1-2 percentage points)** what percent of top posts for each of the three subreddits plotted below have less than 100,000 upvotes? (Give answers for each subreddit)
                r/news - 84.5%
                r/worldnews - 98.5% 
                r/science - 78.7%
- **2.2.4** - **Approximately (within 1-2 percentage points)** what is the probability that a post on each of the three subreddits plotted below has more than 70,000 upvotes? (Give answers for each subreddit)
                r/news - 73.2%
                r/worldnews - 12.1% 
                r/science - 96.7%

In [23]:
import plotly.express as px

fig = px.ecdf(part2_data[part2_data['subreddit_name_prefixed'].isin(["r/news","r/worldnews","r/science"])], 
              x="ups",
              facet_col='subreddit_name_prefixed',
             height=400,width=800)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.update_xaxes(matches=None)
fig.show()

### Temporal Trends 

To answer this question we are going to plot the average upvotes and number of top posts of a subreddit in our dataset per each year.

First, add a ```year``` column to the data, that represents the year in which the post was sent. You likely want to use the [pandas documentation on dates and times](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html) for this.

In [24]:
# Add the year column
part2_data['year'] = part2_data['created_utc'].apply(lambda a: pd.Timestamp.fromtimestamp(a).year)

As a check on this column, answer the following question:
- **2.2.5** - How many posts in the dataset were sent in 2010?
              35

In [25]:
# Code for 2.2.5 here
part2_data[part2_data['year'].isin([2010])].year.count()

35

Now, we're going to plot the **yearly trend** of average upvotes for each subreddit.

**Hint: We will assume that the average upvotes for a given subreddit in a given year is zero when there are no top posts for that subreddit in that year.** To accurately reflect this, you will have to make sure to account for this case.


In [26]:
# Hint: you do not have to use this function, but it may be useful for you
from itertools import product

# create the zeros dataframe
merge_zeros = pd.DataFrame(product(part2_data.year.unique(), 
                     part2_data.subreddit_name_prefixed.unique()),
                     columns =['year','subreddit_name_prefixed']
                )
merge_zeros['ups_with_zeros'] = 0

# merge with the non-zero data ... you write this code ...
avg_ups = part2_data.groupby(['year', 'subreddit_name_prefixed'])
avg_ups_df = pd.DataFrame(avg_ups.ups.mean())
avg_ups_df_merged= avg_ups_df.merge(merge_zeros, on=['year','subreddit_name_prefixed'], how='right').drop(['ups_with_zeros'],axis=1)
avg_ups_df_merged['ups'] = avg_ups_df_merged['ups'].replace(np.nan, 0)

As a check, please do the following:

- **2.2.6** - In your report, provide a table (a screenshot of a pandas dataframe is fine) that shows the average number of upvotes for r/memes each year from 2015 to 2020. The table should be sorted by year (i.e. 2015, then 2016, etc.). Note again, if a year does not have data, there should be zeros in this table!

In [27]:
# Code for 2.2.6 here
# Hint: you do not have to use this function, but it may be useful for you
from itertools import product

avg_ups_df_merged[avg_ups_df_merged['subreddit_name_prefixed'].isin(['r/memes'])].query('year >= 2015 & year <= 2020').sort_values(by='year')


Unnamed: 0,year,subreddit_name_prefixed,ups
183,2015,r/memes,0.0
111,2016,r/memes,0.0
39,2017,r/memes,0.0
87,2018,r/memes,131206.0
63,2019,r/memes,135859.126984
15,2020,r/memes,141141.427305


- **2.2.7** - Plot a line graph of the temporal trend of mean upvotes from 2016-2020 for the following subreddits: r/Jokes, r/food,r/conspiracy, and r/news . You can plot them individually, or use the faceting approach from above. Write your code for this in the cell below; copy the resulting plot to your PDF report. **Hint: Doing part 2.2.8 will be easiest if you make sure that the plot for each subreddit has its own y-axis!**. 
- **2.2.8** - Using what you have plotted, make an argument for which of the four subreddits is the most "up and coming" - i.e. the one that seems to be getting more popular over time. NOTE: There is more than one reasonable answer here. We are looking for how you justify your answer using the (plotted) data.

      The subreddit r/news is the most up and coming of the four given subreddits. The subreddits r/food and r/conspiracy display a rather flat change in the number of upvotes over the years. And, in the contrary, the subreddit r/jokes displays a downward trend of upvotes between the years 2016 and 2020. The subreddit r/news shows a gradual yet consistent increase in the number of upvotes during the years observed and can be deemed the most up and coming subreddit.

In [28]:
# Code for 2.2.7 here
subreddit_filter = avg_ups_df_merged[avg_ups_df_merged['subreddit_name_prefixed'].isin(['r/Jokes', 'r/food', 'r/conspiracy', 'r/news'])]
year_filter = subreddit_filter[subreddit_filter['year'].isin([2016,2017,2018,2019,2020])].sort_values(by='year')

fig = px.line(year_filter, x = 'year', y = 'ups', facet_col = 'subreddit_name_prefixed', facet_col_wrap = 2, facet_col_spacing = 0.08,
              width = 1080, height = 720)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[1]))
fig.show()

## Part 2.3 - Data Cleaning and some final regression-oriented data exploration

With the above analysis, we've learned some things about what predicts upvotes:
- Which subreddit the post is in is seems to matter quite a bit for the number of upvotes
- Time: there are temporal trends, although separate for each subreddit, that seem to be predictive

As we gear up to create our linear regression model to try to predict the number of upvotes for posts, we are going to turn to two last steps:
1. Data cleaning - we're going to take a look at some bivariate statistics, which are going to reveal some columns in our data that are not useful.  We'll then remove them.
2. Looking at univariate relationships with our outcome - we are going to plot relationships between a few of the remaining interesting continuous variables and our outcome of interest (upvotes)


### Cleaning our data

Below, we list the columns of our dataset...

In [29]:
part2_data.columns

Index(['created_utc', 'is_crosspostable', 'is_self', 'is_video', 'locked',
       'media_only', 'over_18', 'score', 'subreddit_id',
       'subreddit_name_prefixed', 'subreddit_subscribers', 'title',
       'permalink', 'total_awards_received', 'downs', 'gilded', 'num_comments',
       'num_crossposts', 'num_reports', 'ups', 'author_name', 'year'],
      dtype='object')

Let's start by looking at the continuous variables. Those are:
- ```total_awards_received```
- ```downs```
- ```gilded```
- ```num_comments```
- ```num_crossposts```
- ```num_reports```
- ```created_utc```
- ```subreddit_subscribers```

- **2.3.1**-  There are two continuous variables that are very clearly not going to be useful for our analysis. Identify them, and explain why they are not useful (**note: you do NOT need to know why these variables take on the values they do in our data. You just need to know why we don't want to use them!**)

      The two continous variables that are not useful for our analysis , for our particular data are - 'downs' and 'num_reports'. This is because the value of downs is 0 for every case in the data and the value of of num_reports is Nan throughout the data.

Let's now look at our (supposedly) binary categorical variables:
- ```is_crosspostable```
- ```is_self```
- ```media_only```
- ```is_video```
- ```locked```
- ```over_18```

- **2.3.2**-  There are two (supposedly) binary variables that are very clearly not going to be useful for our analysis. Identify them, and explain why they are not useful (**note: you do NOT need to know why these variables take on the values they do in our data. You just need to know why we don't want to use them!**)

      The two binary categorical variables that are very clearly not going to be useful are 'is_crosspostable' and 'media_only'. Though they are binary in nature, for all of the cases present in our data they have the same value. Both is_crosspostable and media_only have the value False throughout the data.

Finally, let's look at our remaining variables, which are categorical. One of these, ```title``` (the post's title), is potentially a *very* useful feature... but we haven't yet learned how to use it. So, for now, we're not going to.  The other categorical features are:
- ```subreddit_id```
- ```subreddit_name_prefixed```
- ```permalink```

- **2.3.3** -  Explain why we it is not useful to use *both* ```subreddit_id``` and ```subreddit_name_prefixed``` in any predictive analysis of per-post upvotes.

      The subreddit_id will be unique for each subreddit and has a one to one mapping with the subreddit_name_prefixed. Hence, we can use either the id of the subreddit or it's name in our analysis.

- **2.3.4** - Explain why it is not useful to use ```permalink``` in any predictive analysis of per-post upvotes.

      Permalink just links to a comment in the submission and it is not going to exert any influence on the analysis of the upvotes per post.


## Univariate relationships with the outcome

- **2.3.5** - Plot the relationship between ```num_comments``` and upvotes as a scatterplot with log-scaled axes, with the posts from different subreddits as different color points. Paste this plot into your PDF writeup

- **2.3.6** - Describe, briefly (a sentence) the relationship between ```num_comments``` and upvotes.

        The number of comments and the number of upvotes display a positive correlation except for some outliers. The increase in the number of comments will pull the upvotes up.


In [30]:
# Code for 2.3.5 here
fig = px.scatter(part2_data, x='num_comments',y='ups',color='subreddit_name_prefixed', log_x=True, log_y=True)
fig.show()

Compute the [Pearson correlation](https://pandas.pydata.org/docs/reference/api/pandas.Series.corr.html#pandas.Series.corr) between ```ups``` and all other continuous variables (minus those you identified as not interesting in 2.3.1).

- **2.3.7** - Which of these has the strongest positive correlation with ```ups```?

        num_crossposts has the strongest positive correlation with ups
- **2.3.8** - Which of these has the weakest positive correlation with ```ups```?

        created_utc has the weakest positive correlation with ups.

In [31]:
# Code for 2.3.7-8 here
for cv in ['total_awards_received', 'gilded', 'num_comments', 'num_crossposts', 'created_utc', 'subreddit_subscribers']:
    print('Correlation between ups and '+ cv + ' : ' + str(part2_data['ups'].corr(part2_data[cv], method = 'pearson')))

Correlation between ups and total_awards_received : 0.3881542917691692
Correlation between ups and gilded : 0.22810273946438525
Correlation between ups and num_comments : 0.3306995383954572
Correlation between ups and num_crossposts : 0.5379816522109334
Correlation between ups and created_utc : 0.16547147438976492
Correlation between ups and subreddit_subscribers : 0.4102478018285364


# Part 3 - Linear Regression

OK! We've got a decent handle on our data, and we're ready to do some learning. 

We're going to use a linear regression model to predict the number of upvotes.

## Part 3.1 - Regression to predict ```ups```

You will need to write code that does the following:

1. Recreates (if you did not already store it in your dataset) a variable for the year a post was sent in. Now, create a column ```year```, which. Then, subsets your data to only posts from 2015-2021 (inclusive).

2. Creates a feature matrix, ```X```, that contains features for the following variables:
- ```total_awards_received```
- ```gilded```
- ```num_comments```
- ```num_crossposts```
- ```year```
- ```is_self```
- ```is_video```
- ```locked```
- ```over_18```
- ```subreddit_name_prefixed```

3. Creates an outcome variable, ```y```, that is **the logarithm of** ```ups +1```.
4. Splits the data into train and test (80% training, 20% testing) using the relevant ```sklearn``` function. **We have written this line of code for you below, please do not change the random state!**
5. Trains a linear regression model on the training data
6. Evaluates the model you have trained on the test set, using ```RMSE``` as an error metric. **You should calculate this error using ONLY ```pandas``` and/or ```numpy```, not ```sklearn```.**
7. Prints the error

A few useful hints:
- You cannot use ```subreddit_name_prefixed``` as is, you have to transform it somehow. We have suggested a tool to do so below (the ```OneHotEncoder```)
- You also need to transform any boolean variables to 0/1 encodings


In [36]:
# Lets just reload the data in to make sure we're all starting fresh!
part3_data = pd.read_csv("part2_data.csv")

#Extracting year from UTC timestamp for temporal analysis
part3_data['year'] = part3_data['created_utc'].apply(lambda a: pd.Timestamp.utcfromtimestamp(a).year)
part3_data = part3_data[part3_data['year'].between(2015, 2021)]

In [37]:
## NOTE: Typically we would not rescale a time variable, but it's fine for this assignment.
CONTINUOUS_VARS = ["total_awards_received", "gilded", "num_comments", "num_crossposts","year"]
BINARY_VARS = ["is_self", "is_video", "locked", "over_18"]

for var in BINARY_VARS:
    # Write your code here to make sure the boolean variables are formatted as integers, as is required by sklearn
    part3_data.var = part3_data[var].astype(int)
    

In [38]:
from sklearn.preprocessing import OneHotEncoder

def onehot_encode_var(data, varname):
    # This function should take in a variable name in part3_data and return a onehot encoded matrix for that variable
    
    # Here's a starting point!
    #The one hot encoder does label coding for all the categories in the specified column
    #The first column is usually dropped since its redundant information
    encoder = OneHotEncoder(drop='first')

    # Use the encoder
    onehot_encoded_variable = encoder.fit_transform(data[varname].values.reshape(-1, 1)).toarray()
    
    # return the onehot encoded variable
    return onehot_encoded_variable, encoder.categories_

In [39]:
# OK, now we're going to write our code to run the model!
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# First, rescale the continuous variables
# Standard scaling is applied which converts the data into a distribution with zero mean and a standard deviation of 1
continuous_rescaled_X = StandardScaler().fit_transform(part3_data[CONTINUOUS_VARS].values)

# Now, we can use our function above to get the onehotencoding for the subreddits ... go ahead!
name_encoded_X, _ = onehot_encode_var(part3_data, 'subreddit_name_prefixed')

# Now you can combine all of your features into a single feature matrix. Call it X
binary_X = part3_data.loc[:, BINARY_VARS]

# Building the feature matrix with all the independent variables
X = np.concatenate((continuous_rescaled_X, binary_X), 1)
X = np.concatenate((X, name_encoded_X), 1)

# And create your outcome variable, call it y
y = np.log(part3_data.loc[:, 'ups'].values + 1)

# Don't change this line!
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# fit a linear regression model, with an intercept
# Linear regression model, by default intercept is included
lmodel = LinearRegression()
lmodel.fit(X_train, y_train)

y_test_predicted = lmodel.predict(X_test)

# Compute RMSE = squareroot(Sum(true values - predicted values)^2/N)
rmse = np.sqrt((np.square(y_test - y_test_predicted)).mean())
print('RMSE on testing data {:.2}'.format(rmse))


RMSE on testing data 0.32


### Questions to check understanding

- **3.1.1** - Report your error on the test data, in RMSE. State what this metric means for the expected error in terms of the number of upvotes (not log upvotes!) you should expect to be off on any given prediction

      The root mean square error for the regression is 0.32. This signifies how the residuals (how far the data points are from the best fit line) are spread out around the best fit line. The lower the value of RMSE, the closer the data points are to the best fit line.

Also, a few questions to target your understanding of how we set up the model:
- **3.1.2** - What did the whole one-hot encoding thing on ```subreddit_name_prefixed``` actually do? 

      Most machine learning models requires cleanup of data before you try to fit into the models for better learning. Here, one-hot encoding is the process of converting the categorical variables into unique binary variables. This works best for categorical variables where no ordinal relationship exists between the categories. 

- **3.1.3** - What does the argument ```drop = "first"``` do for us when we are doing that to ```subreddit_name_prefixed```?

      It specifies to drop one of the categories (here it represents the subreddit name) per feature to remove redundant data. For example, consider the Gender categorical variable with categories Male and Female. The one hot encoding for this categorical variable will be as follows: 
      Male_enc Female_enc
      0        1
      1        0
      Here we know that, if Male_enc is 0, that data point corresponds to Female_enc and vice versa. Hence, dropping the first column doesn't change anything and removes redundant information.       

    
- **3.1.3** - Why did we need to add one to the outcome variable before using ```log```?

      Adding one to the outcome variable before using log is necessary because there are many zero valued observations of X and because log(0) is undefined and we would be getting RunTimeException for diving it by zero. So we can simply add 1 to the log to avoid that. 
    

- **3.1.4** - What does the ```StandardScaler``` do? Why do we want to do that?

      We need to bring all the values to the same scale, to make the computations easy. StandardScaler converts the dataset into a distribution with zero mean and a standard deviation of 1. Since the variables are now unitless, the prediction is improved. 


OK. Having looked at our RMSE, we should now realize that we have to be careful about assuming that this one statistic is actually a good estimate of how far we're going to be off on any prediction, selected at random. To see this, let's do the following:
- **3.1.5** - Provide a scatterplot that compares the true values in ```y_test``` to the absolute value of the difference between ```y_test``` and your predictions. **The axes should be on the original scale** (i.e. not the log scale you're predicting on.
- **3.1.6** - What does this plot suggest about how well your model fits the data as the true number of upvotes changes? 

        As the number of upvotes changes, the difference between the predicted and true values are relatively small as we can see in the scatter plot. This suggests that the model fits well for the data. 


In [43]:
# Code for 3.1.5 here
abs_diff = np.abs(np.subtract(np.exp(y_test),np.exp(y_test_predicted)))
fig = px.scatter(x=np.exp(y_test), y=abs_diff)
fig.show()


One final thing we are going to play with in 3.1. Logging the dependent variable is useful for a few reasons we have or will discuss in class (depending on when you're reading this). But it's also sometimes useful to log *independent* variables as well. Below, redo the same analysis as above, but after logging the non-temporal continuous variables (i.e. all the continuous variables except ```created_utc```). Use these as predictors instead of the original continuous variables. **Note: Perform the logging before you rescale the variables. Also, you should add 1 as we did for the dependent variable above**.

- **3.1.7** - What is the new RMSE with the logged independent variables?

      The new RMSE with the logged independent variables is 0.303.  
- **3.1.8** - How did this compare to the old RMSE? Why do you think that is? Hint: It may help to re-plot the same figure as you did in 3.1.5, but with the new model, in order to answer this question.

      Logarithmic transformation is a convinent way for transforming the highly skewed variables into a more normalised dataset. This helps in improving the accuracy of the model, in theory, we want to produce the smallest error possible when making a predictions. So, Using the logarithms on one or more variables will improve the fit of the model. By using the logarithmic transformation for all features in X, it has helped in improving the model fit, thus reducing the RMSE to 0.30 from 0.32. 


In [45]:
# Code for 3.1.7 here
# Removing 'year' since we don't have to apply log transformation to it.
CONTINUOUS_VARS_WITHOUT_CREATED_UTC = ["total_awards_received", "gilded", "num_comments", "num_crossposts"]
continuous_variables_X = np.log(part3_data.loc[:, CONTINUOUS_VARS_WITHOUT_CREATED_UTC].values + 1)

# Adding 'year' after log transformation is applied to other continuous variables
continuous_variables_X = np.concatenate((continuous_variables_X, part3_data.loc[:, ['year']].values), axis=1)

# First, rescale the continuous variables
continuous_rescaled_X = StandardScaler().fit_transform(continuous_variables_X)

# Now, we can use our function above to get the onehotencoding for the subreddits ... go ahead!
name_encoded_X, categories_ = onehot_encode_var(part3_data, 'subreddit_name_prefixed')

# logging the name encoded
name_encoded_X = np.log(name_encoded_X + 1)

# Now you can combine all of your features into a single feature matrix. Call it X
binary_X = part3_data.loc[:, BINARY_VARS].values

# logging the  binary values
binary_X = np.log(binary_X + 1)

X = np.concatenate((continuous_rescaled_X, binary_X), axis=1)
X = np.concatenate((X, name_encoded_X), axis=1)

feature_X = np.concatenate((CONTINUOUS_VARS, BINARY_VARS), axis=None)
feature_X = np.concatenate((feature_X, np.array(categories_[0])[1:]), axis=None)

# And create your outcome variable, call it y
y = np.log(part3_data.loc[:, 'ups'].values + 1)

# Don't change this line!
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# fit a linear regression model, with an intercept
# Linear regression model
lmodel = LinearRegression()
lmodel.fit(X_train, y_train)

y_test_predicted = lmodel.predict(X_test)

# Compute RMSE
rmse = np.sqrt((np.square(y_test - y_test_predicted)).mean())
print('RMSE on testing data {:.3}'.format(rmse))

RMSE on testing data 0.303


In [46]:
# code for 3.1.8
abs_diff = np.abs(np.exp(y_test) - np.exp(y_test_predicted))
fig = px.scatter(x=np.exp(y_test), y=abs_diff)
fig.show()


## Part 3.2 - Exploration of regression coefficients

Now, let's look at the effects of our variables for this last model (with the logarithms of the independent variables). Carefully re-combine your features with their labels (*hint, ```encoder.categories_``` will be your friend, and remember, we dropped the first category!*)

- **3.2.1** - What is the strongest positive predictor of upvotes? How many more log(upvotes+1) does a one standard deviation increase in the feature correspond to?

      'r/memes', for every 1 standard deviation change in r/memes there is an increase of 1.14274892 log (upvotes + 1)

- **3.2.2** - What is the strongest negative predictor of upvotes? How many fewer log(upvotes+1) does a one standard deviation increase in the feature correspond to?

      'r/learnprogramming', for every 1 standard deviation change in r/learnprogramming there is a decrease of 4.1386755 log (upvotes + 1)

In [47]:
# Add your code for 3.2 here
coefs = pd.DataFrame(
    lmodel.coef_,
    columns=["Coefficients"],
    index=feature_X,
)

coefs

Unnamed: 0,Coefficients
total_awards_received,0.087056
gilded,0.03223
num_comments,0.166539
num_crossposts,0.097167
year,-0.076029
is_self,-0.270867
is_video,-0.119884
locked,0.057639
over_18,0.013042
r/PoliticalHumor,-0.035165


# Part 3.3 - 574 Only - Attempting to Improve Your Predictions 

In class, we talked about a few things we might do to improve our model's predictions. These include adding interaction terms, including different functional forms of a feature, using a different model, etc. Here, we ask that you implement at least two of these, and then re-evaluate your model. We'll ask some of the teams with some of the more interesting/effective ideas here to come present their solutions to the class!

- **3.3.1** - Describe at least two changes you made -- at least one to the feature set, and at least one different model -- to try to improve prediction.  Explain *why* you think that these changes make sense, given the Exploratory analyses above, or any other exploratory analysis you choose to do.

        For the feature set, we tried using the variable subreddit_subscribers since the variable has higher correletion with the number of upvotes. 

        The regression model has been changed from linear regression to Support Vector Regression. SVR tries to fit the best fit line within the threshold of values, and the errors inside of this threshold don't matter. The model produced by support vector regression depends on only on the subset of the training data, because the cost function for building the model does not care about the training points that lie behind the margin, (we have notified outliers in the data for num_comments, num_crossposts). Here, we have continuous and categorical variables, SVM can easily handle. 


- **3.3.2** - By how much did your RMSE improve? Which change that you made improved it the most? How do you know?

        RMSE has improved from 0.32 to 0.27. Changing the model to SVM has improved the model the most, it was first observed that the RMSE value fell down to 0.29, but when the model was changed to SVM, the RMSE fell even further to 0.27.

In [53]:
# Add your code for 3.3.1 here
from sklearn import svm
from sklearn import metrics

CONTINUOUS_VARS_WITHOUT_CREATED_UTC = ["total_awards_received", "gilded", "num_comments", "num_crossposts", "subreddit_subscribers"]
continuous_variables_X = np.log(part3_data.loc[:, CONTINUOUS_VARS_WITHOUT_CREATED_UTC].values + 1)
continuous_variables_X = np.concatenate((continuous_variables_X, part3_data.loc[:, ['year']].values), axis=1)

# First, rescale the continuous variables
continuous_rescaled_X = StandardScaler().fit_transform(continuous_variables_X)

# Now, we can use our function above to get the onehotencoding for the subreddits ... go ahead!
name_encoded_X, categories_ = onehot_encode_var(part3_data, 'subreddit_name_prefixed')

# logging the name encoded
name_encoded_X = np.log(name_encoded_X + 1)

# Now you can combine all of your features into a single feature matrix. Call it X
binary_X = part3_data.loc[:, BINARY_VARS].values

# logging the  binary values
binary_X = np.log(binary_X + 1)

X = np.concatenate((continuous_rescaled_X, binary_X), axis=1)
X = np.concatenate((X, name_encoded_X), axis=1)

feature_X = np.concatenate((CONTINUOUS_VARS, BINARY_VARS), axis=None)
feature_X = np.concatenate((feature_X, categories_), axis=None)

# And create your outcome variable, call it y
y = np.log(part3_data.loc[:, 'ups'].values + 1)

# Don't change this line!
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Using Scaled vector regression with rbf kernel function.  
lmodel = svm.SVR()
lmodel.fit(X_train, y_train)
y_test_predicted = lmodel.predict(X_test)

rmse = np.sqrt((np.square(y_test - y_test_predicted)).mean())
print('RMSE on testing data {:.2}'.format(rmse))


RMSE on testing data 0.27
