## Week 1: Webscraping Demo with Beautiful Soup 

### First let's build a simple 1 column DataFrame with Beautiful Soup 

Note: Beautiful soup is used for demo purposes, we will be using Selenium or Scrapy in our project 

In [19]:
# imports and environment setup 
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup 

In [20]:
URL = 'https://community.mypaint.org/'
page = requests.get(URL, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(page.content, 'html.parser')

#### Let's inspect the page and find the data we want 

We are looking for where the titles live 

In [21]:
# Table tag 
table = soup.find_all('table')

In [22]:
len(table)

1

In [23]:
# Table row tag 
rows = table[0].find_all('tr')

In [24]:
len(rows) # 31 topics  

31

In [25]:
rows[0] # headers 

<tr>
<th>Topic</th>
<th></th>
<th>Replies</th>
<th>Views</th>
<th>Activity</th>
</tr>

In [26]:
# rows[30]

In [27]:
# Within each row tag 
span = rows[1].find_all('span')

In [28]:
span[0]

<span class="link-top-line">
<a class="title raw-link raw-topic-link" href="https://community.mypaint.org/t/welcome-to-the-mypaint-community/8">Welcome to the MyPaint Community!</a>
</span>

In [29]:
# Within each span tag 
a = span[0].find_all('a')

In [30]:
a[0].text

'Welcome to the MyPaint Community!'

#### Let's make a for loop 

Store all titles in an array with a for loop 

Later if we want more than 1 column we can use a dictionary 

In [32]:
titles = []

# Omit 1 up to length of row - do not want header - no title in there 
for i in range (1, len(rows)):
    
    # finding 'span' in each row 
    span = rows[i].find_all('span')
    #print(span)
    
    # Finding 'a' tags in spans 
    a = span[0].find_all('a')
    titles.append(a[0].text)

In [33]:
# We have the first 30 titles stored 
len(titles)

30

In [34]:
# Make Pandas dataframe 
dataframe = pd.DataFrame(titles)

In [39]:
# Print first 30 rows of DataFrame 
dataframe.head(32)

Unnamed: 0,0
0,Welcome to the MyPaint Community!
1,New Developer Looking For Some Code Pointers
2,MyPaint v2.0.1 Released
3,MyPaint v2.0.0 Released
4,Add Selection Tools
5,When will MyPaint 2.0.0 packaged into reposito...
6,Any way to get this effect?
7,MyPaint v2.0.0 Postponed
8,Disable checkboard during Layer Solo mode?
9,Christmas for Brothers and Sister


Now we have a DataFrame of 1 column with the titles we can export as a csv file 

This is very bare minimum so we want to build on this dataset 

### Next steps: 
#### Here you guys will get to start figuring out things and building your skills! 

#### There are more than 30 posts (rows) existing on this forum 
We want more rows (we can decide how many specifically) 

Trade off: more rows - longer to train, too few rows - not enough to gain meaningful insights 

#### We don't have separate pages
We have a dynamically loading page - conitnuous scrolling 
- we have no separate URLs, API function call instead 

#### Proposed solutions 
1: Go into javascript and figure out API call for dynamic loading to use function in code 

2: Easier - use Selenium - allows python to control web browser - scroll down and get the page over and over again 

-> idea: encapsulate in another for or while loop (for how many times to call) and change syntax to Selenium 

#### Decide if we want more columns 
Information include: actual text in each topic, metadata (replies, activity) 

One idea: Use dictionairies or append columns (must make sure columns line up) 

To use Dictionary: 

-Each dictionary is a row 

-Each user is represented as a row 

-Key is column name, value is the unique value 

-Each dictionary has same keys but different values 

#### If we want to sort the posts in any way 
Latest versus top 

References: Webscraping: https://erikrood.com/Python_References/web_scrape.html

By: Jennifer Dong 