# Data Collection
---

For the purpose of this project, data will be collected from two subreddits via the use of reddit API. The two subreddits that have been selected are English Premier League and Champions League. The API request will be run multiple times so as to hit the target of 700-1000 distinct posts for each subreddit. The multiple files for each subreddit will then be combined and duplicates removed. The codes and the steps taken for data collection are specified below. 

### Contents:

- [1. Import Libraries](#1.-Import-Libraries)
- [2. English Premier League](#2.-English-Premier-League)
- [3. Champions League](#3.-Champions-League)

## 1. Import Libraries
---

In [1]:
import requests
import pandas as pd
import time
import random

## 2. English Premier League
---

In [132]:
# create a function to pull the data from reddit
def pull_data(url, headers):
    posts = []
    after = None

    while len(posts) <= 1000:
        try:
            if after == None:
                current_url = url
            else:
                current_url = url + '?after=' + after
            print(current_url)
            res = requests.get(current_url, headers=headers)
    
            if res.status_code != 200:
                print('Status error', res.status_code)
                break
    
            current_dict = res.json()
            current_posts = [p['data'] for p in current_dict['data']['children']]
            posts.extend(current_posts)
            after = current_dict['data']['after']
    
            # generate a random sleep duration to look more 'natural'
            sleep_duration = random.randint(2,60)
            print(sleep_duration)
            time.sleep(sleep_duration)
    
        except TypeError:
            print("No more posts!")
            break
        
        except KeyboardInterrupt:
            print("Done!")
            break
    
    return posts

In [42]:
# 2nd run
url = "https://www.reddit.com/r/PremierLeague/hot.json"
headers = {"User-agent": "premier league hot posts"}

In [43]:
# 2nd run
epl_posts2 = pull_data(url, headers)

https://www.reddit.com/r/PremierLeague/hot.json
13
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_o096xd
5
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_ny65n5
24
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nvusj1
11
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nt94kz
42
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nr4csv
30
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nouqq6
39
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nnlahz
16
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nmeqbp
54
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nk7vne
59
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_njd3qb
53
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_niziv9
34
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_ngygdp
38
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nf7fl9
46
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_neqow5
20
https://www.

In [59]:
# 2nd run - save as dataframe
epl2 = pd.DataFrame(epl_posts2)

In [61]:
# 2nd run - save to csv
epl2.to_csv("../data/premier_league2.csv", index = False)

In [144]:
# 3rd run
url = "https://www.reddit.com/r/PremierLeague/top.json?t=year"
headers = {"User-agent": "premier league top posts"}

In [145]:
# 3rd run
epl_posts3 = pull_data(url, headers)

https://www.reddit.com/r/PremierLeague/top.json?t=year
40
https://www.reddit.com/r/PremierLeague/top.json?t=year?after=t3_nmxh9q
42
https://www.reddit.com/r/PremierLeague/top.json?t=year
11
Done!


In [146]:
# 3rd run - save to csv
pd.DataFrame(epl_posts3).to_csv("../data/premier_league3.csv", index = False)

In [199]:
# 4th run 
url = "https://www.reddit.com/r/PremierLeague/hot.json"
headers = {"User-agent": "premier league hot posts"}

In [200]:
# 4th run
epl_posts4 = pull_data(url, headers)

https://www.reddit.com/r/PremierLeague/hot.json
33
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nzysg4
22
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nyo7y0
59
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nwlww1
8
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nssokm
46
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nrayte
51
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_npn6dy
7
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_noffuy
7
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nm3tlr
19
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nk6q9w
24
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_njd6iy
54
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nj8mc9
3
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nhavxq
10
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_nfnh4k
57
https://www.reddit.com/r/PremierLeague/hot.json?after=t3_ndtfds
13
https://www.red

In [201]:
# 4th run - save to csv
pd.DataFrame(epl_posts4).to_csv("../data/premier_league4.csv", index = False)

In [249]:
# 5th run 
url = "https://www.reddit.com/r/PremierLeague/new.json"
headers = {"User-agent": "premier league top posts"}

In [250]:
epl_posts5 = pull_data(url, headers)

https://www.reddit.com/r/PremierLeague/new.json
8
https://www.reddit.com/r/PremierLeague/new.json?after=t3_o0y4ul
2
https://www.reddit.com/r/PremierLeague/new.json?after=t3_ny8eu6
6
https://www.reddit.com/r/PremierLeague/new.json?after=t3_nw6ydf
56
https://www.reddit.com/r/PremierLeague/new.json?after=t3_nt94kz
51
https://www.reddit.com/r/PremierLeague/new.json?after=t3_nr9414
3
https://www.reddit.com/r/PremierLeague/new.json?after=t3_np60pr
42
https://www.reddit.com/r/PremierLeague/new.json?after=t3_nnw79l
33
https://www.reddit.com/r/PremierLeague/new.json?after=t3_nm5c8t
46
https://www.reddit.com/r/PremierLeague/new.json?after=t3_nk6q9w
47
https://www.reddit.com/r/PremierLeague/new.json?after=t3_njewh0
38
https://www.reddit.com/r/PremierLeague/new.json?after=t3_nj510d
7
https://www.reddit.com/r/PremierLeague/new.json?after=t3_nh3umk
38
https://www.reddit.com/r/PremierLeague/new.json?after=t3_nfve5c
20
https://www.reddit.com/r/PremierLeague/new.json?after=t3_nf10tp
34
https://www.redd

In [251]:
# 5th run - save to csv
pd.DataFrame(epl_posts5).to_csv("../data/premier_league5.csv", index = False)

### Combine Files

In [272]:
# read the data files
data1 = pd.read_csv("../data/premier_league1.csv")
data2 = pd.read_csv("../data/premier_league2.csv")
data3 = pd.read_csv("../data/premier_league3.csv")
data4 = pd.read_csv("../data/premier_league4.csv")
data5 = pd.read_csv("../data/premier_league5.csv")
data6 = pd.read_csv("../data/premier_league_final.csv")

In [273]:
# combine data
data_combined = pd.concat([data1, data2, data3, data4, data5, data6], axis=0)

In [274]:
# replace na in selftext with empty string
data_combined["selftext"] = data_combined["selftext"].fillna("")

In [275]:
# display
data_combined.head()

Unnamed: 0.1,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url_overridden_by_dest,preview,poll_data,crosspost_parent_list,crosspost_parent,media_metadata,is_gallery,gallery_data,author_cakeday,Unnamed: 0
0,,PremierLeague,What's on your mind? This is the daily discuss...,t2_6l4z3,False,,0,False,r/PremierLeague Daily Discussion,"[{'a': ':xpl:', 'e': 'emoji', 'u': 'https://em...",...,,,,,,,,,,
1,,PremierLeague,,t2_rk9wt,False,,0,False,The 2021/22 Premier League fixtures have been ...,"[{'a': ':xpl:', 'e': 'emoji', 'u': 'https://em...",...,https://www.premierleague.com/news/2171434,{'images': [{'source': {'url': 'https://extern...,,,,,,,,
2,,PremierLeague,,t2_rq268,False,,0,False,Ashley Young agrees personal terms with Aston ...,"[{'a': ':ava:', 'e': 'emoji', 'u': 'https://em...",...,https://www.skysports.com/football/news/11677/...,{'images': [{'source': {'url': 'https://extern...,,,,,,,,
3,,PremierLeague,Help: how to get tickets for next Derby (Satur...,t2_qotcg,False,,0,False,Old Trafford Tickets,"[{'e': 'text', 't': 'Question'}]",...,,{'images': [{'source': {'url': 'https://extern...,,,,,,,,
4,,PremierLeague,I have thought this for a while but IMO the De...,t2_39afif5m,False,,0,False,Goalkeepers are protected too much in football,"[{'e': 'text', 't': 'Discussion'}]",...,,,,,,,,,,


In [276]:
# remove rows that have the same selftext and title
df = data_combined.drop_duplicates(subset=["title", "selftext"], keep="first")

In [280]:
# dataframe dimension
df.shape

(943, 117)

In [278]:
# save combine file to csv
df.to_csv("../data/premier_league_combined.csv", index = False)

## 3. Champions League
---

In [26]:
url = "https://www.reddit.com/r/championsleague/hot.json"
headers = {"User-agent": "champions league hot posts"}

In [27]:
# 1st pull
cl_posts1 = pull_data(url, headers)

https://www.reddit.com/r/championsleague/hot.json
57
https://www.reddit.com/r/championsleague/hot.json?after=t3_nnwb9p
26
https://www.reddit.com/r/championsleague/hot.json?after=t3_nlswb4
20
https://www.reddit.com/r/championsleague/hot.json?after=t3_na66uh
22
https://www.reddit.com/r/championsleague/hot.json?after=t3_n4xv16
38
https://www.reddit.com/r/championsleague/hot.json?after=t3_n0d57y
13
https://www.reddit.com/r/championsleague/hot.json?after=t3_mu34yr
30
https://www.reddit.com/r/championsleague/hot.json?after=t3_mr9dgt
41
https://www.reddit.com/r/championsleague/hot.json?after=t3_mp1w3s
26
https://www.reddit.com/r/championsleague/hot.json?after=t3_mlzbo3
47
https://www.reddit.com/r/championsleague/hot.json?after=t3_mhcfuu
8
https://www.reddit.com/r/championsleague/hot.json?after=t3_m7dqwj
27
https://www.reddit.com/r/championsleague/hot.json?after=t3_m25vff
32
https://www.reddit.com/r/championsleague/hot.json?after=t3_ly9ihs
24
https://www.reddit.com/r/championsleague/hot.json?a

In [121]:
# 1st pull
cpl1 = pd.DataFrame(cl_posts1)
cpl1.to_csv("../data/champions_league1.csv", index = False)

In [281]:
# 2nd pull 
url = "https://www.reddit.com/r/championsleague/top.json?t=year"
headers = {"User-agent": "champions league top posts"}

In [282]:
# 2nd pull 
cl_posts2 = pull_data(url, headers)

https://www.reddit.com/r/championsleague/top.json?t=year
51
https://www.reddit.com/r/championsleague/top.json?t=year?after=t3_m1spa8
45
https://www.reddit.com/r/championsleague/top.json?t=year
26
Done!


In [283]:
pd.DataFrame(cl_posts2).to_csv("../data/champions_league2.csv", index = False)

In [177]:
# 3rd pull
url = "https://www.reddit.com/r/championsleague/hot.json"
headers = {"User-agent": "champions league hot posts"}

In [178]:
# 3rd pull
cl_posts3 = pull_data(url, headers)

https://www.reddit.com/r/championsleague/hot.json
56
https://www.reddit.com/r/championsleague/hot.json?after=t3_nnussh
42
https://www.reddit.com/r/championsleague/hot.json?after=t3_nlgm5w
35
https://www.reddit.com/r/championsleague/hot.json?after=t3_n9u7sw
28
https://www.reddit.com/r/championsleague/hot.json?after=t3_n53kpu
17
https://www.reddit.com/r/championsleague/hot.json?after=t3_mztft8
16
https://www.reddit.com/r/championsleague/hot.json?after=t3_mueya8
42
https://www.reddit.com/r/championsleague/hot.json?after=t3_mqle37
48
https://www.reddit.com/r/championsleague/hot.json?after=t3_moqk1s
41
https://www.reddit.com/r/championsleague/hot.json?after=t3_mljpst
53
https://www.reddit.com/r/championsleague/hot.json?after=t3_mgizd5
46
https://www.reddit.com/r/championsleague/hot.json?after=t3_m7bp30
23
https://www.reddit.com/r/championsleague/hot.json?after=t3_m14rwr
35
https://www.reddit.com/r/championsleague/hot.json?after=t3_lsayly
34
https://www.reddit.com/r/championsleague/hot.json?

In [179]:
# 3rd pull - save to csv
pd.DataFrame(cl_posts3).to_csv("../data/champions_league3.csv", index = False)

In [289]:
# 4th pull
url = "https://www.reddit.com/r/championsleague/new.json"
headers = {"User-agent": "champions league hot posts"}

In [290]:
# 4th pull
cl_posts4 = pull_data(url, headers)

https://www.reddit.com/r/championsleague/new.json
36
https://www.reddit.com/r/championsleague/new.json?after=t3_nnx3x8
37
https://www.reddit.com/r/championsleague/new.json?after=t3_nlw5mt
27
https://www.reddit.com/r/championsleague/new.json?after=t3_na66uh
44
https://www.reddit.com/r/championsleague/new.json?after=t3_n4xv16
30
https://www.reddit.com/r/championsleague/new.json?after=t3_n08r7b
51
https://www.reddit.com/r/championsleague/new.json?after=t3_mu7zm3
59
https://www.reddit.com/r/championsleague/new.json?after=t3_mr2wso
47
https://www.reddit.com/r/championsleague/new.json?after=t3_mp1w3s
31
https://www.reddit.com/r/championsleague/new.json?after=t3_mlu45u
12
https://www.reddit.com/r/championsleague/new.json?after=t3_mhcfuu
4
https://www.reddit.com/r/championsleague/new.json?after=t3_m786hf
51
https://www.reddit.com/r/championsleague/new.json?after=t3_m1spa8
30
https://www.reddit.com/r/championsleague/new.json?after=t3_lsh81n
7
https://www.reddit.com/r/championsleague/new.json?af

In [291]:
# 4th pull - save to csv
pd.DataFrame(cl_posts4).to_csv("../data/champions_league4.csv", index = False)

### Combine Files

In [299]:
# read the data files
cpl_1 = pd.read_csv("../data/champions_league1.csv")
cpl_2 = pd.read_csv("../data/champions_league2.csv")
cpl_3 = pd.read_csv("../data/champions_league3.csv")
cpl_4 = pd.read_csv("../data/champions_league4.csv")
cpl_5 = pd.read_csv("../data/champions_league.csv")

In [300]:
# combine data
cpl_combined = pd.concat([cpl_1, cpl_2, cpl_3, cpl_4, cpl_5], axis=0)

In [301]:
# replace na with empty string
cpl_combined["selftext"] = cpl_combined["selftext"].fillna("")

In [302]:
# remove rows that have the same selftext and title
cpl_df = cpl_combined.drop_duplicates(subset=["title", "selftext"], keep="first")

In [305]:
# dimension
cpl_df.shape

(992, 115)

In [304]:
# save combine file to csv
cpl_df.to_csv("../data/champions_league_combined.csv", index = False)