# Web Scraping
#### Credit to Aiden Curley for assistance

<a name="contents"></a>
- [Contents](#contents)  
    - [Imports](#imports)  
    - [API Function](#API's)  
    - [Web Scraping](#scraping)  
    - [Create Dataframes](#dataframes)  
    - [Save to File](#save)

<a name="imports"></a>
- [Back to Contents](#contents)
## Imports

In [1]:
# imports:
import pandas as pd
import requests
import time
import pandas as pd

<a name="API's"></a>
- [Back to Contents](#contents)
## API requests Functions

In [2]:
# Get Submission 
# Credit to: https://pythonprogramming.altervista.org/collect-data-from-reddit/?doing_wp_cron=1597670992.0452320575714111328125
def submission_request_amd(**kwargs):
    # This does a get request to pull Submissions from the Machine Learning subreddit       #kwargs will be the params
    request = requests.get("https://api.pushshift.io/reddit/search/submission/?subreddit=Amd", kwargs)
    # This gets us data from the pull request
    data = request.json()
    # This gets us the individual 'DATA' for each post --> add [0] to get the first post
    return data['data']

In [3]:
def submission_request_build_pc(**kwargs):
    # This does a get request to pull Submissions from the Machine Learning subreddit       #kwargs will be the params
    request = requests.get("https://api.pushshift.io/reddit/search/submission/?subreddit=buildapc", kwargs)
    # This gets us data from the pull request
    data = request.json()
    # This gets us the individual 'DATA' for each post --> add [0] to get the first post
    return data['data']

In [4]:
subreddit_amd = []
title_amd = []
selftext_amd =[]
subreddit_build_pc = []
title_build_pc = []
selftext_build_pc =[]

<a name="scraping"></a>
- [Back to Contents](#contents)
## Web Scraping

In [11]:
# Create a loop that pulls from each subreddit 5,000 times
# Inspired by this source: https://www.textjuicer.com/2019/07/crawling-all-submissions-from-a-subreddit/
# Count down for each iteration: https://datatofish.com/while-loop-python/
before_amd = None    # UTC will start at the latest post, then each iteration will start before the last post pull
before_build_pc = None
iterations_times_25 = 5000
while iterations_times_25 != 0:   # Each iteration pull 25 post using the Pushshift API
    time.sleep(1.5)  # There needs to be a wait time for each pull as it takes time for the webpage to refresh
    amd_submissions = submission_request_amd(before = before_amd, sort='desc', sort_type='created_utc')
    build_pc_submissions = submission_request_build_pc(before = before_build_pc, sort='desc', sort_type='created_utc')

# https://stackoverflow.com/questions/38707513/ignoring-an-error-message-to-continue-with-the-loop-in-python    
    
#   AMD
    for xi in amd_submissions:
        try:
            before_amd = xi['created_utc']
            selftext_amd.append(xi['selftext'])
            subreddit_amd.append(xi['subreddit'])
            title_amd.append(xi['title'])
        except:
            pass
#   Build PC   
    for n in build_pc_submissions:
        try:
            before_build_pc = n['created_utc']
            selftext_build_pc.append(n['selftext'])
            subreddit_build_pc.append(n['subreddit'])
            title_build_pc.append(n['title'])
        except:
            pass
    iterations_times_25 -= 1  # Decrease the iterations_times_25 until it goes to 0
    time.sleep(1.5)  # More time to ensure webpage reloads on time


ConnectionError: HTTPSConnectionPool(host='api.pushshift.io', port=443): Max retries exceeded with url: /reddit/search/submission/?subreddit=buildapc&before=1596254766&sort=desc&sort_type=created_utc (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f802fc1b370>: Failed to establish a new connection: [Errno -2] Name or service not known'))

In [None]:
# Connection error occured, but I was able to pull over 100_000 posts fro each subreddit

In [14]:
print(len(subreddit_amd))
print(len(title_amd))
print(len(selftext_amd))

109740
109740
109740


In [15]:
print(len(subreddit_build_pc))
print(len(title_build_pc))
print(len(selftext_build_pc))

110325
110325
110325


<a name="dataframes"></a>
- [Back to Contents](#contents)
# Create Dataframes

In [16]:
# Create dataframes with columns showing the subreddit, title and selftext
data_amd = pd.DataFrame({'subreddit': subreddit_amd,
                        'title': title_amd,
                        'selftext': selftext_amd
                       })

In [17]:
data_amd.head()

Unnamed: 0,subreddit,title,selftext
0,Amd,Low VRAM Issues?,[removed]
1,Amd,[Level1Techs] 32 Core Threadripper Workstation...,
2,Amd,Decent cheap AMD prebuilt systems? Canada,Trying to see if there's anything worth recomm...
3,Amd,The new drivers don't wanna download,I have been trying to update my drivers from t...
4,Amd,aorus redemption I on b550m,"last time, i said only gigabyte has not used d..."


In [18]:
data_amd.shape

(109740, 3)

In [19]:
data_build_pc = pd.DataFrame({'subreddit': subreddit_build_pc,
                        'title': title_build_pc,
                        'selftext': selftext_build_pc
                       })

In [20]:
data_build_pc.head()

Unnamed: 0,subreddit,title,selftext
0,buildapc,Windows 10,Hey guys just finished my build not able to ge...
1,buildapc,Is there a somewhat reasonably priced GPU that...,I have been rocking a Vega blower for a couple...
2,buildapc,How's This Budget Build?,Mobo: ASUS PRIMW B450M-A $109\n\nCPU: Ryzen 3...
3,buildapc,New case and liquid cooling,So I’m thinking about buying a new case and ad...
4,buildapc,Quick questions on possible GPUs,"I've currently got my Motherboard mATX, 3X 120..."


In [21]:
data_build_pc.shape

(110325, 3)

<a name="save"></a>
- [Back to Contents](#contents)
# Save to file

In [22]:
data_amd.to_csv('./Data/amd_big.csv')

In [23]:
data_build_pc.to_csv('./Data/build_pc_big.csv')