<img src="seekingalpha.png" style="float: left; margin: 20px; height: 55px">

# Notebook 1- Webscraping

#### _Webscraping Long and Short Ideas on Seeking Alpha_

---
### Notebook Summary
- The method for acquiring data will be scraping the "Long Ideas" and "Short Ideas" threads as listed under Seeking Alpha's annalyses tab from 1/1/2017 through 1/31/2018
- The Requests and Beautiful Soup libraries will be utilized to extract certain pieces of information about each article:
     - The title of the article
     - The author of the article 
     - The stock ticker that the article references
     - The length of time the article has been up on Seeking Alpha 
     - The link of the article itself 
- The Pandas library and engineered functions will be leveraged to transform the webscrape results into DataFrames    
- The DataFrames will be saved and exported as csv files for accessibility throughout the remainder of the notebooks, most specifically in order to clean (Notebook 2) and perform Exploratory Data Analysis (Notebook 3)
---

In [1]:
# Importing necessary libraries 
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup

---
## Webscraping 
---
### _Long Ideas_

In [2]:
long_request_list = []  # Creating empty list to append results to 

for i in range(48,235):  
    # Range (x,x) indicates pages being scraped
    # Pages selected to scrape selected dates 
    response = requests.get("https://seekingalpha.com/stock-ideas/long-ideas?page="+str(i), 
    headers = {'User-agent': '-----lauren'}) 
        # 'str(i)' indicates page number being scraped  
        #  Introducing header and delay between each request- polite web scraping practices 
    time.sleep(30)  
    print(response.status_code) 
    if response.status_code == 200: # Scraping/parsing HTML only if proper connection made 
        html = response.text    
        soup = BeautifulSoup(html,'lxml') 
        long_request_list.append(soup)   # Appending results to empty list
    print(f'(https://seekingalpha.com/stock-ideas/long-ideas?page={str(i)})')
    print(i)

200
(https://seekingalpha.com/stock-ideas/long-ideas?page=48)
48
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=49)
49
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=50)
50
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=51)
51
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=52)
52
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=53)
53
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=54)
54
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=55)
55
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=56)
56
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=57)
57
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=58)
58
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=59)
59
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=60)
60
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=61)
61
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=62)
62
200
(https://seekingalpha

200
(https://seekingalpha.com/stock-ideas/long-ideas?page=172)
172
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=173)
173
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=174)
174
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=175)
175
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=176)
176
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=177)
177
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=178)
178
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=179)
179
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=180)
180
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=181)
181
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=182)
182
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=183)
183
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=184)
184
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=185)
185
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=186)

### _Short Ideas_

In [3]:
short_request_list = []  # Creating empty list to append soup results to 

for i in range(5,28):  
    # Range (x,x) indicates pages being scraped
    # Pages selected to scrape selected dates 
    response = requests.get("https://seekingalpha.com/stock-ideas/short-ideas?page="+str(i),
    headers = {'User-agent': 'LaurenCable-GeneralAssemblyCapstone-6787994961-laurencable10@gmail.com'})
        # 'str(i)' indicates page number being scraped  
        #  Introducing header and delay between each request- polite web scraping practices 
    time.sleep(30)     
    print(response.status_code)  
    if response.status_code == 200: # Scraping/parsing HTML only if proper connection made 
        html = response.text         
        soup = BeautifulSoup(html,'lxml')  
        short_request_list.append(soup)   # Appending results to empty list
    print(f'(https://seekingalpha.com/stock-ideas/short-ideas?page={str(i)})')
    print(i)

200
(https://seekingalpha.com/stock-ideas/short-ideas?page=5)
5
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=6)
6
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=7)
7
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=8)
8
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=9)
9
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=10)
10
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=11)
11
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=12)
12
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=13)
13
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=14)
14
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=15)
15
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=16)
16
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=17)
17
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=18)
18
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=19)
19
200
(https://seeking

---
## Extracting Desired Data
---
### _Long Ideas_

In [4]:
# Function to extract titles

def extract_longideas_titles(long_request_list):
    longideas_titles = []   # Creating empty list to append article titles to 
    for i in long_request_list:  # Iterating through scraped results 
        for each in i.find_all('a',{'class':'a-title'}): # Calling proper HTML tags/classes 
            longideas_titles.append(each.text)  # Appending text data to empty list
    return(longideas_titles)

# Setting function as variable
fetch_longideas_titles = extract_longideas_titles(long_request_list)

In [5]:
# Function to extract links

def extract_longideas_links(long_request_list):
    longideas_links = []    # Creating empty list to append article links to 
    for i in long_request_list:  # Iterating through scraped results 
        for each in i.find_all('div',{'class':'media-body'}): # Calling proper HTML tags/classes
            result = each.find('a')
            if result:
                link = result['href']
                longideas_links.append(link) # Appending data to empty list
    return(longideas_links)

# Setting function as variable
fetch_longideas_links = extract_longideas_links(long_request_list)

In [6]:
# Function to extract author, ticker, and time posted 

def extract_longideas_multi(long_request_list):
    longideas_multi = [] # Creating empty list to append requests to
    for i in long_request_list: # Iterating through requests
        for each in i.find_all('div',{'class':'a-info'}): # Calling proper HTML tags/classes
            results = each.find_all('span')
            longideas_multi.append(each.text) # Appending text data to empty list
    return(longideas_multi)

# Setting function as variable 
fetch_longideas_multi = extract_longideas_multi(long_request_list)

# Text for all 3 features in 1 string, separated by bullets
fetch_longideas_multi[:3]

['WFC• Mon, Jan. 15, 8:57 AM • FIG Ideas•7\xa0Comments',
 'TSLA• Mon, Jan. 15, 8:21 AM • Nick Cox•466\xa0Comments',
 'BTE• Mon, Jan. 15, 8:19 AM • Long Player•93\xa0Comments']

In [7]:
# In order to separate attribute in the string above
#
# Transforming results into DataFrame 

longideas_df = pd.DataFrame({'Multi':fetch_longideas_multi})
longideas_df.head(3)

# Utilizing bullet as delimiter
#
# str-split and extract each attribute via indexing position
longideas_tickers = longideas_df['Multi'].str.split('•').str.get(0)
longideas_times = longideas_df['Multi'].str.split('•').str.get(1)
longideas_authors = longideas_df['Multi'].str.split('•').str.get(2)

### _Short Ideas_

In [8]:
# Function to extract titles

def extract_shortideas_titles(short_request_list):  
    shortideas_titles = []  # Creating empty list to append article titles to
    for i in short_request_list:  # Iterating through scraped results 
        for each in i.find_all('a',{'class':'a-title'}): # Calling proper HTML tags/classes   
                shortideas_titles.append(each.text) # Appending text data to empty list 
    return(shortideas_titles)

# Setting function as variable
fetch_shortideas_titles = extract_shortideas_titles(short_request_list)

In [9]:
# Function to extract links

def extract_shortideas_links(short_request_list):
    shortideas_links = []   # Creating empty list to append article links to
    for i in short_request_list:  # Iterating through scraped results
        for each in i.find_all('div',{'class':'media-body'}): # Calling proper HTML tags/classes 
            result = each.find('a')
            if result:
                link = result['href']
                shortideas_links.append(link) # Appending data to empty list 
    return(shortideas_links)

# Setting function as variable
fetch_shortideas_links = extract_shortideas_links(short_request_list)

In [10]:
# Function to extract author, ticker, and time posted 

def extract_shortideas_multi(short_request_list):
    shortideas_multi = [] # Creating empty list to append requests to
    for i in short_request_list: # Iterating through requests
        for each in i.find_all('div',{'class':'a-info'}): # Calling proper HTML tags/classes
            results = each.find_all('span')
            shortideas_multi.append(each.text) # Appending text data to empty list
    return(shortideas_multi)

# Set function as variable 
fetch_shortideas_multi = extract_shortideas_multi(short_request_list)

# Text for all 3 features in 1 string, separated by bullets
fetch_shortideas_multi[:3]

['CROX• Thu, Jan. 18, 7:30 AM • ALT Perspective',
 'TSLA• Wed, Jan. 17, 5:09 PM • Bill Maurer•410\xa0Comments',
 'HAWK• Wed, Jan. 17, 4:58 PM • Friedrich Chen•6\xa0Comments']

In [11]:
# In order to separate attribute in the string above
#
# Transforming results into DataFrame 

shortideas_df = pd.DataFrame({'Multi':fetch_shortideas_multi})
shortideas_df.head(3)

# Utilizing bullet as delimiter
#
# str-split and extract each attribute via indexing position

shortideas_tickers = shortideas_df['Multi'].str.split('•').str.get(0)
shortideas_times = shortideas_df['Multi'].str.split('•').str.get(1)
shortideas_authors = shortideas_df['Multi'].str.split('•').str.get(2)

---
## Transforming Results into "Ideas " DataFrames
---
### _Long Ideas_

In [12]:
# Appending results to dataframes

longideas = pd.DataFrame({'Time': longideas_times,'Link': fetch_longideas_links,
                          'Title':fetch_longideas_titles,'Authors': longideas_authors,
                          'Tickers': longideas_tickers})

In [13]:
longideas['Strategy'] = 'Long' # Adding column for strategy type 

### _Short Ideas_

In [14]:
# Appending results to dataframe

shortideas = pd.DataFrame({'Time': shortideas_times,'Link': fetch_shortideas_links,
                          'Title': fetch_shortideas_titles,'Authors': shortideas_authors,
                           'Tickers': shortideas_tickers})

In [15]:
shortideas['Strategy'] = 'Short' # Adding column for strategy type 

### _ Ideas DataFrame_

In [19]:
ideas = pd.concat([longideas,shortideas])  # Merging long and short ideas

---
## Saving Results 
---

In [20]:
# Exporting Ideas dataframe as csv
ideas.to_csv('ideas_df.csv',index=False)

## Onwards!

---

## Please proceed to Notebook 2 :)  