<img src="seekingalpha.png" style="float: left; margin: 20px; height: 55px">

# Notebook 1- Webscraping

#### _Webscraping Long and Short Ideas on Seeking Alpha_

---
### Notebook Summary
- The method for acquiring data will be scraping the "Long Ideas" and "Short Ideas" threads as listed under Seeking Alpha's annalyses tab from 1/1/2017 through 1/31/2018
- The Requests and Beautiful Soup libraries will be utilized to extract certain pieces of information about each article:
     - The title of the article
     - The author of the article 
     - The stock ticker that the article references
     - The length of time the article has been up on Seeking Alpha 
     - The link of the article itself 
- The Pandas library and engineered functions will be leveraged to transform the webscrape results into DataFrames    
- The DataFrames will be saved and exported as csv files for accessibility throughout the remainder of the notebooks, most specifically in order to clean (Notebook 2) and perform Exploratory Data Analysis (Notebook 3)
---

In [22]:
# Importing necessary libraries 
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup

---
## Webscraping 
---
### _Long Ideas_

In [25]:
long_request_list = []  # Creating empty list to append results to 

for i in range(390,657):  
    # Range (x,x) indicates pages being scraped
    # Pages selected to scrape selected dates 
    response = requests.get("https://seekingalpha.com/stock-ideas/long-ideas?page="+str(i), 
    headers = {'User-agent': '6787994961-laurencable-laurencable10@gmail.com'}) 
        # 'str(i)' indicates page number being scraped  
        #  Introducing header and delay between each request- polite web scraping practices 
    time.sleep(30)  
    print(response.status_code) 
    if response.status_code == 200: # Scraping/parsing HTML only if proper connection made 
        html = response.text    
        soup = BeautifulSoup(html,'lxml') 
        long_request_list.append(soup)   # Appending results to empty list
    print(f'(https://seekingalpha.com/stock-ideas/long-ideas?page={str(i)})')
    print(i)

200
(https://seekingalpha.com/stock-ideas/long-ideas?page=390)
390
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=391)
391
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=392)
392
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=393)
393
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=394)
394
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=395)
395
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=396)
396
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=397)
397
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=398)
398
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=399)
399
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=400)
400
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=401)
401
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=402)
402
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=403)
403
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=404)

200
(https://seekingalpha.com/stock-ideas/long-ideas?page=513)
513
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=514)
514
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=515)
515
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=516)
516
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=517)
517
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=518)
518
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=519)
519
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=520)
520
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=521)
521
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=522)
522
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=523)
523
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=524)
524
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=525)
525
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=526)
526
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=527)

200
(https://seekingalpha.com/stock-ideas/long-ideas?page=636)
636
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=637)
637
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=638)
638
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=639)
639
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=640)
640
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=641)
641
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=642)
642
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=643)
643
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=644)
644
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=645)
645
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=646)
646
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=647)
647
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=648)
648
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=649)
649
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=650)

### _Short Ideas_

In [26]:
short_request_list = []  # Creating empty list to append soup results to 

for i in range(53,90):  
    # Range (x,x) indicates pages being scraped
    # Pages selected to scrape selected dates 
    response = requests.get("https://seekingalpha.com/stock-ideas/short-ideas?page="+str(i),
    headers = {'User-agent': 'capstone-laurencable-dsi-ga'})
        # 'str(i)' indicates page number being scraped  
        #  Introducing header and delay between each request- polite web scraping practices 
    time.sleep(30)     
    print(response.status_code)  
    if response.status_code == 200: # Scraping/parsing HTML only if proper connection made 
        html = response.text         
        soup = BeautifulSoup(html,'lxml')  
        short_request_list.append(soup)   # Appending results to empty list
    print(f'(https://seekingalpha.com/stock-ideas/short-ideas?page={str(i)})')
    print(i)

200
(https://seekingalpha.com/stock-ideas/short-ideas?page=53)
53
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=54)
54
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=55)
55
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=56)
56
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=57)
57
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=58)
58
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=59)
59
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=60)
60
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=61)
61
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=62)
62
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=63)
63
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=64)
64
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=65)
65
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=66)
66
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=67)
67
200
(https

---
## Extracting Desired Data
---
### _Long Ideas_

In [27]:
# Function to extract titles

def extract_longideas_titles(long_request_list):
    longideas_titles = []   # Creating empty list to append article titles to 
    for i in long_request_list:  # Iterating through scraped results 
        for each in i.find_all('a',{'class':'a-title'}): # Calling proper HTML tags/classes 
            longideas_titles.append(each.text)  # Appending text data to empty list
    return(longideas_titles)

# Setting function as variable
fetch_longideas_titles = extract_longideas_titles(long_request_list)

In [28]:
# Function to extract links

def extract_longideas_links(long_request_list):
    longideas_links = []    # Creating empty list to append article links to 
    for i in long_request_list:  # Iterating through scraped results 
        for each in i.find_all('div',{'class':'media-body'}): # Calling proper HTML tags/classes
            result = each.find('a')
            if result:
                link = result['href']
                longideas_links.append(link) # Appending data to empty list
    return(longideas_links)

# Setting function as variable
fetch_longideas_links = extract_longideas_links(long_request_list)

In [29]:
# Function to extract author, ticker, and time posted 

def extract_longideas_multi(long_request_list):
    longideas_multi = [] # Creating empty list to append requests to
    for i in long_request_list: # Iterating through requests
        for each in i.find_all('div',{'class':'a-info'}): # Calling proper HTML tags/classes
            results = each.find_all('span')
            longideas_multi.append(each.text) # Appending text data to empty list
    return(longideas_multi)

# Setting function as variable 
fetch_longideas_multi = extract_longideas_multi(long_request_list)

# Text for all 3 features in 1 string, separated by bullets
fetch_longideas_multi[:3]

['DLNG• Apr. 4, 2016, 11:12 AM • Dirk Leach•55\xa0Comments',
 'BAM• Apr. 4, 2016, 11:07 AM • Eric Sprague•34\xa0Comments',
 'KKD• Apr. 4, 2016, 10:46 AM • Adelphi Venture Capital']

In [30]:
# In order to separate attribute in the string above
#
# Transforming results into DataFrame 

longideas_df = pd.DataFrame({'Multi':fetch_longideas_multi})
longideas_df.head(3)

# Utilizing bullet as delimiter
#
# str-split and extract each attribute via indexing position
longideas_tickers = longideas_df['Multi'].str.split('•').str.get(0)
longideas_dates = longideas_df['Multi'].str.split('•').str.get(1)
longideas_authors = longideas_df['Multi'].str.split('•').str.get(2)

### _Short Ideas_

In [31]:
# Function to extract titles

def extract_shortideas_titles(short_request_list):  
    shortideas_titles = []  # Creating empty list to append article titles to
    for i in short_request_list:  # Iterating through scraped results 
        for each in i.find_all('a',{'class':'a-title'}): # Calling proper HTML tags/classes   
                shortideas_titles.append(each.text) # Appending text data to empty list 
    return(shortideas_titles)

# Setting function as variable
fetch_shortideas_titles = extract_shortideas_titles(short_request_list)

In [32]:
# Function to extract links

def extract_shortideas_links(short_request_list):
    shortideas_links = []   # Creating empty list to append article links to
    for i in short_request_list:  # Iterating through scraped results
        for each in i.find_all('div',{'class':'media-body'}): # Calling proper HTML tags/classes 
            result = each.find('a')
            if result:
                link = result['href']
                shortideas_links.append(link) # Appending data to empty list 
    return(shortideas_links)

# Setting function as variable
fetch_shortideas_links = extract_shortideas_links(short_request_list)

In [33]:
# Function to extract author, ticker, and time posted 

def extract_shortideas_multi(short_request_list):
    shortideas_multi = [] # Creating empty list to append requests to
    for i in short_request_list: # Iterating through requests
        for each in i.find_all('div',{'class':'a-info'}): # Calling proper HTML tags/classes
            results = each.find_all('span')
            shortideas_multi.append(each.text) # Appending text data to empty list
    return(shortideas_multi)

# Set function as variable 
fetch_shortideas_multi = extract_shortideas_multi(short_request_list)

# Text for all 3 features in 1 string, separated by bullets
fetch_shortideas_multi[:3]

['DVDCF, DVDCY• Apr. 4, 2016, 7:59 AM • The Investment Doctor•3\xa0Comments',
 'MNKD• Apr. 4, 2016, 7:07 AM • DoconStocks•31\xa0Comments',
 'FCX• Apr. 4, 2016, 5:12 AM • Colorado Wealth Management Fund•21\xa0Comments']

In [34]:
# In order to separate attribute in the string above
#
# Transforming results into DataFrame 

shortideas_df = pd.DataFrame({'Multi':fetch_shortideas_multi})
shortideas_df.head(3)

# Utilizing bullet as delimiter
#
# str-split and extract each attribute via indexing position

shortideas_tickers = shortideas_df['Multi'].str.split('•').str.get(0)
shortideas_dates = shortideas_df['Multi'].str.split('•').str.get(1)
shortideas_authors = shortideas_df['Multi'].str.split('•').str.get(2)

---
## Transforming Results into "Ideas " DataFrames
---
### _Long Ideas_

In [36]:
# Appending results to dataframes

longideas = pd.DataFrame({'Date': longideas_dates,'Link': fetch_longideas_links,
                          'Title':fetch_longideas_titles,'Authors': longideas_authors,
                          'Tickers': longideas_tickers})

In [38]:
longideas['Strategy'] = 'Long' # Adding column for strategy type 

### _Short Ideas_

In [37]:
# Appending results to dataframe

shortideas = pd.DataFrame({'Date': shortideas_dates,'Link': fetch_shortideas_links,
                          'Title': fetch_shortideas_titles,'Authors': shortideas_authors,
                           'Tickers': shortideas_tickers})

In [39]:
shortideas['Strategy'] = 'Short' # Adding column for strategy type 

### _ Ideas DataFrame_

In [40]:
ideas = pd.concat([longideas,shortideas])  # Merging long and short ideas

In [42]:
ideas.head(5)

Unnamed: 0,Authors,Date,Link,Tickers,Title,Strategy
0,Dirk Leach,"Apr. 4, 2016, 11:12 AM",/article/3962952-lng-carrier-just-doubled-back...,DLNG,This LNG Carrier Just Doubled Its Backlog And ...,Long
1,Eric Sprague,"Apr. 4, 2016, 11:07 AM",/article/3962940-2015-intrinsic-value-brookfie...,BAM,2015 Intrinsic Value For Brookfield Asset Mana...,Long
2,Adelphi Venture Capital,"Apr. 4, 2016, 10:46 AM",/article/3962944-krispy-kremes-international-e...,KKD,Krispy Kreme's International Expansion Will Be...,Long
3,Alexander Maxwell,"Apr. 4, 2016, 10:44 AM",/article/3962938-eagle-pharmaceuticals-next-mo...,EGRX,Eagle Pharmaceuticals' Next Move After Surpris...,Long
4,Dan Stringer,"Apr. 4, 2016, 10:00 AM",/article/3962770-surmodics-discounted-fast-tra...,SRDX,"SurModics: A Discounted, Fast-Tracked Transfor...",Long


---
## Saving Results 
---

In [41]:
# Exporting Ideas dataframe as csv
ideas.to_csv('IDEAS.csv',index=False)

## Onwards!

---

## Please proceed to Notebook 2 :)  