<img src="seekingalpha.png" style="float: left; margin: 20px; height: 55px">

# Notebook 1- Webscraping Seeking Alpha

### _Webscraping Long and Short Ideas on Seeking Alpha_

---
## Notebook Summary
- Utilizing the Requests and Beautiful Soup libraries to construct webscrapers and engineer functions in order to acquire the following data for Long and Short stock ideas on Seeking Alpha : 
> - "Title" of the article 
> - "Author" of the article
> - Stock "Ticker" the article references 
> - Length of "Time: article has been posted on Seeking Alpha
> - "Link" of the article itself
  The method for acquiring data will be scraping the "Long Ideas" and "Short Ideas" threads as listed under Seeking Alpha's annalyses tab from 1/1/2017 through 1/31/2018
- Leveraging the Pandas library to transform scraped results into a DataFrame that will be exported as a csv file to be used inn subsequent notebooks.
---
### _Importing Necessary Libraries_
---

In [1]:
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup

---
## Constructing Webscrapers 
---
### _Long Ideas_

In [84]:
long_request_list = []  # Creating empty list to append results to 

for i in range(210,213):  
    # Range (x,x) indicates pages being scraped
    # Pages selected to scrape selected dates 
    response = requests.get("https://seekingalpha.com/stock-ideas/long-ideas?page="+str(i), 
    headers = {'User-agent': '-----lauren'}) 
        # 'str(i)' indicates page number being scraped  
        #  Introducing header and delay between each request- polite web scraping practices 
    time.sleep(1)  
    print(response.status_code) 
    if response.status_code == 200: # Scraping/parsing HTML only if proper connection made 
        html = response.text    
        soup = BeautifulSoup(html,'lxml') 
        long_request_list.append(soup)   # Appending results to empty list
    print(f'(https://seekingalpha.com/stock-ideas/long-ideas?page={str(i)})')
    print(i)

200
(https://seekingalpha.com/stock-ideas/long-ideas?page=210)
210
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=211)
211
200
(https://seekingalpha.com/stock-ideas/long-ideas?page=212)
212


### _Short Ideas_

In [98]:
short_request_list = []  # Creating empty list to append soup results to 

for i in range(21,24):  
    # Range (x,x) indicates pages being scraped
    # Pages selected to scrape selected dates 
    response = requests.get("https://seekingalpha.com/stock-ideas/short-ideas?page="+str(i),
    headers = {'User-agent': 'LaurenCable-GeneralAssemblyCapstone-6787994961-laurencable10@gmail.com'})
        # 'str(i)' indicates page number being scraped  
        #  Introducing header and delay between each request- polite web scraping practices 
    time.sleep(1)     
    print(response.status_code)  
    if response.status_code == 200: # Scraping/parsing HTML only if proper connection made 
        html = response.text         
        soup = BeautifulSoup(html,'lxml')  
        short_request_list.append(soup)   # Appending results to empty list
    print(f'(https://seekingalpha.com/stock-ideas/short-ideas?page={str(i)})')
    print(i)

200
(https://seekingalpha.com/stock-ideas/short-ideas?page=21)
21
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=22)
22
200
(https://seekingalpha.com/stock-ideas/short-ideas?page=23)
23


---
## Extracting Information for Each Article 
---
### Long Ideas
### _Extracing All TItles_ 
----

In [85]:
def extract_longideas_titles(long_request_list):
    longideas_titles = []   # Creating empty list to append article titles to 
    for i in long_request_list:  # Iterating through scraped results 
        for each in i.find_all('a',{'class':'a-title'}): # Calling proper HTML tags/classes 
            longideas_titles.append(each.text)  # Appending text data to empty list
    return(longideas_titles)

# Setting function as variable
fetch_longideas_titles = extract_longideas_titles(long_request_list)

### _Extracing All Links_ 
----

In [86]:
def extract_longideas_links(long_request_list):
    longideas_links = []    # Creating empty list to append article links to 
    for i in long_request_list:  # Iterating through scraped results 
        for each in i.find_all('div',{'class':'media-body'}): # Calling proper HTML tags/classes
            result = each.find('a')
            if result:
                link = result['href']
                longideas_links.append(link) # Appending data to empty list
    return(longideas_links)

# Setting function as variable
fetch_longideas_links = extract_longideas_links(long_request_list)

### _Extracting All Authors, Tickers, and TIme of Posting_ 
----
- The HTML format for these three features is a little tricky, as they are separated by bullet points. However, this can be side-stepped by utilizing the bullet points as a delimiter and indexing to grab each separate feature, shown below.

In [87]:
def extract_longideas_multi(long_request_list):
    longideas_multi = [] # Creating empty list to append requests to
    for i in long_request_list: # Iterating through requests
        for each in i.find_all('div',{'class':'a-info'}): # Calling proper HTML tags/classes
            results = each.find_all('span')
            longideas_multi.append(each.text) # Appending text data to empty list
    return(longideas_multi)

# Setting function as variable 
fetch_longideas_multi = extract_longideas_multi(long_request_list)

# Text for all 3 features in 1 string, separated by bullets
fetch_longideas_multi[:3]

['CTSH• May 7, 2017, 4:21 PM • Activist Stocks•5\xa0Comments',
 'GIS• May 7, 2017, 3:42 PM • D.M. Martins Research•29\xa0Comments',
 "Editors' Pick • MMM• May 7, 2017, 1:58 PM • Axia Enantio•8\xa0Comments"]

In [88]:
# In order to separate attribute in the string above
# Transforming results into DataFrame 

longideas_df = pd.DataFrame({'Multi':fetch_longideas_multi})
longideas_df.head(3)

# Utilizing bullet as delimiter
# str-split and extract each attribute via indexing position
longideas_tickers = longideas_df['Multi'].str.split('•').str.get(0)
longideas_times = longideas_df['Multi'].str.split('•').str.get(1)
longideas_authors = longideas_df['Multi'].str.split('•').str.get(2)

### Short Ideas
### _Extracing All TItles_ 
----

In [99]:
def extract_shortideas_titles(short_request_list):  
    shortideas_titles = []  # Creating empty list to append article titles to
    for i in short_request_list:  # Iterating through scraped results 
        for each in i.find_all('a',{'class':'a-title'}): # Calling proper HTML tags/classes   
                shortideas_titles.append(each.text) # Appending text data to empty list 
    return(shortideas_titles)

# Setting function as variable
fetch_shortideas_titles = extract_shortideas_titles(short_request_list)

### _Extracing All Links_ 
----

In [100]:
def extract_shortideas_links(short_request_list):
    shortideas_links = []   # Creating empty list to append article links to
    for i in short_request_list:  # Iterating through scraped results
        for each in i.find_all('div',{'class':'media-body'}): # Calling proper HTML tags/classes 
            result = each.find('a')
            if result:
                link = result['href']
                shortideas_links.append(link) # Appending data to empty list 
    return(shortideas_links)

# Setting function as variable
fetch_shortideas_links = extract_shortideas_links(short_request_list)

### _Extracting All Authors, Tickers, and TIme of Posting_
----

In [101]:
def extract_shortideas_multi(short_request_list):
    shortideas_multi = [] # Creating empty list to append requests to
    for i in short_request_list: # Iterating through requests
        for each in i.find_all('div',{'class':'a-info'}): # Calling proper HTML tags/classes
            results = each.find_all('span')
            shortideas_multi.append(each.text) # Appending text data to empty list
    return(shortideas_multi)

# Set function as variable 
fetch_shortideas_multi = extract_shortideas_multi(short_request_list)

# Text for all 3 features in 1 string, separated by bullets
fetch_shortideas_multi[:3]

['HPE• Jun. 6, 2017, 1:09 PM • Jay Wei•14\xa0Comments',
 "Editors' Pick • MD• Jun. 6, 2017, 11:20 AM • Michael Boyd•7\xa0Comments",
 'MNK, ESRX• Jun. 6, 2017, 9:36 AM • Citron Research•6\xa0Comments']

In [102]:
# In order to separate attribute in the string above
# Transforming results into DataFrame 

shortideas_df = pd.DataFrame({'Multi':fetch_shortideas_multi})
shortideas_df.head(3)

# Utilizing bullet as delimiter
# str-split and extract each attribute via indexing position

shortideas_tickers = shortideas_df['Multi'].str.split('•').str.get(0)
shortideas_times = shortideas_df['Multi'].str.split('•').str.get(1)
shortideas_authors = shortideas_df['Multi'].str.split('•').str.get(2)

---
## Transforming Results into DataFrames
---
### Long Ideas

In [89]:
# Function results become columns for dataframe 

longideas = pd.DataFrame({'Time': longideas_times,'Link': fetch_longideas_links,
                          'Title':fetch_longideas_titles,'Authors': longideas_authors,
                          'Tickers': longideas_tickers})

In [90]:
longideas['Strategy'] = 'Long' # Column added for strategy type 

In [91]:
longideas.to_csv('long11.csv')

In [112]:
a = pd.read_csv('long1.csv')
b = pd.read_csv('long2.csv')
c = pd.read_csv('long3.csv')
d = pd.read_csv('long4.csv')
e = pd.read_csv('long5.csv')
f = pd.read_csv('long6.csv')
g = pd.read_csv('long7.csv')
h = pd.read_csv('long8.csv')
i = pd.read_csv('long9.csv')
j = pd.read_csv('long10.csv')
k = pd.read_csv('long11.csv')

In [114]:
longideas = pd.concat([a,b,c,d,e,f,g,h,i,j,k])

In [115]:
longideas.drop('Unnamed: 0',axis=1,inplace=True)

### Short Ideas

In [103]:
# Function results become columns for dataframe 

shortideas = pd.DataFrame({'Time': shortideas_times,'Link': fetch_shortideas_links,
                          'Title': fetch_shortideas_titles,'Authors': shortideas_authors,
                           'Tickers': shortideas_tickers})

In [104]:
shortideas['Strategy'] = 'Short' # Column added for strategy type 

## Comping "Ideas" DataFrame

In [116]:
ideas = pd.concat([longideas,shortideas])  # Merging long and short ideas

## Saving Results - Exporting "Ideas" DataFrame as CSV 
---

In [117]:
# Exporting Ideas dataframe as csv
ideas.to_csv('IDEAS.csv',index=False)

## Onwards

## Please proceed to Notebook 2 :)  
----
