<img src="seekingalpha.png" style="float: left; margin: 20px; height: 55px">

# Notebook 6- Webscraping Article Content 

_Grabbing All Text From Articles_

---
### Notebook Summary
 
#### 1. Subsetting DataFrame
 - Due to lack of computational and amount of text scraped 
     - Almost 3,000 articles
     - Multiple paragraphs each article

#### 2. Webscraping Article Content
 - `Webscraping` selected `Seekinng Alpha` articles for their content
 - Engineered functions to extract all text from each
----
### _Importing Necessary Libraries_
----

In [1]:
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup

### _Importing "Final" DataFrame_ 
-----

In [2]:
final = pd.read_csv('final.csv')

In [3]:
final.head(3) # Printing first few rows of data 

Unnamed: 0.1,Unnamed: 0,Authors,Link,Title,Strategy,Tickers,Opening Price,Week 1,Week 2,Week 3,...,Week 42,Week 43,Week 44,Week 45,Week 46,Week 47,Week 48,Week 49,Week 50,Week 51
0,0,Paulo Santos,/article/4038275-apple-unexpected-positive-app...,Apple: An Unexpected Positive Appears,Long,AAPL,120.0,120.080002,121.629997,130.289993,...,174.25,173.970001,169.979996,174.089996,169.800003,172.669998,176.419998,170.570007,116.150002,118.989998
1,1,Mark Hibben,/article/4038269-apple-renewing-mac-focus-new-...,Apple: Renewing Mac Focus In The New Year,Long,AAPL,120.0,120.080002,121.629997,130.289993,...,174.25,173.970001,169.979996,174.089996,169.800003,172.669998,176.419998,170.570007,116.150002,118.989998
2,2,Rinse Terpstra,/article/4037413-apple-user-base,Apple: It's All About The User Base,Long,AAPL,120.0,120.080002,121.629997,130.289993,...,174.25,173.970001,169.979996,174.089996,169.800003,172.669998,176.419998,170.570007,116.150002,118.989998


### _Compiling List of Links to Gather Article Content for_ 
-----

In [None]:
links = final['Link']

In [4]:
final['Link'][0]

'/article/4038275-apple-unexpected-positive-appears'

In [12]:
requests.get('https://seekingalpha.com/article/4184106-gulfport-energy-special-situation-conviction-strong-buy')

<Response [200]>

In [None]:
content = []  # Creating empty list to append soup results to 

for link in links:  
    # Range (x,x) indicates pages being scraped
    # Pages selected to scrape selected dates 
    response = requests.get("https://seekingalpha.com"+ str(link),
    headers = {'User-agent': 'LaurenCable-GeneralAssemblyCapstone-6787994961-laurencable10@gmail.com'})
        # 'str(i)' indicates page number being scraped  
        #  Introducing header and delay between each request- polite web scraping practices 
    time.sleep(30)     
    print(response.status_code)  
    if response.status_code == 200: # Scraping/parsing HTML only if proper connection made 
        html = response.text         
        soup = BeautifulSoup(html,'lxml')  
        short_request_list.append(soup)   # Appending results to empty list
    print(link)

In [None]:
# Separarting dataframe by strategy type
long = final_dataframe[final_dataframe['Strategy']=='Long']
short = final_dataframe[final_dataframe['Strategy']=='Short']

In [None]:
len(long)

In [None]:
len(short)

In [None]:
# Separarting dataframe by strategy type
long = final[final['Strategy']=='Long']
short = final[final['Strategy']=='Short']

# Unique tickers only in each strategy type dataframe
long_tickers = long.drop_duplicates('Tickers')
short_tickers = short.drop_duplicates('Tickers')

# Compiling lists of unique tickers
long_links = long_tickers['Link'].tolist()
short_links = short_tickers['Link'].tolist()

# Combining lists
language_links = long_links + short_links

# Subsetting final dataframe on list
language_dataframe = final[final['Link'].isin(language_links)]
language_dataframe.head(3)

In [None]:
language_dataframe['Link'][0]

---
## Constructing Webscrapers 
---
### _Links_

In [None]:
short_request_list = []  # Creating empty list to append soup results to 

for i in range(5,28):  
    # Range (x,x) indicates pages being scraped
    # Pages selected to scrape selected dates 
    response = requests.get("https://seekingalpha.com/stock-ideas/short-ideas?page="+str(i),
    headers = {'User-agent': 'LaurenCable-GeneralAssemblyCapstone-6787994961-laurencable10@gmail.com'})
        # 'str(i)' indicates page number being scraped  
        #  Introducing header and delay between each request- polite web scraping practices 
    time.sleep(30)     
    print(response.status_code)  
    if response.status_code == 200: # Scraping/parsing HTML only if proper connection made 
        html = response.text         
        soup = BeautifulSoup(html,'lxml')  
        short_request_list.append(soup)   # Appending results to empty list
    print(f'(https://seekingalpha.com/stock-ideas/short-ideas?page={str(i)})')
    print(i)

In [None]:
content = []

    
response = requests.get('https://seekingalpha.com/article/4037958-tesla-believe-morgan-stanley',
           headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'})
html = response.text
soup = BeautifulSoup(html,'lxml')
content.append(soup)


In [None]:
request_list_9 = []
for link in links[80:90]:  
    # Range (x,x) indicates pages being scraped
    response = requests.get('https://seekingalpha.com'+(link), 
    headers = {'User-agent': 'cablegirl-dsi-ga'}) 
        # 'str(i)' indicates page number being scraped  
        #  Header- way for website owners to contact you   
    time.sleep(5)  # 30 second delay between each request  
    print(response.status_code) 
    if response.status_code == 200: # Scrape/parse HTML only if proper connection made 
        html = response.text    
        soup = BeautifulSoup(html,'lxml') 
        request_list_9.append(soup)   # Append results to empty list
    print(link)

#### Extracting Text & Transforming into DataFrame

In [None]:
# Creating function to extract article text
def extract_article_text(request_list): 
    articles = []   # Empty list to append article text to 
    for i in request_list:  # Iterating through scraped results 
        for each in i.find_all('div',{'id':'a-body'}): # Calling proper HTML tags/classes 
            articles.append(each.text)  # Appending text data to empty list
    return(articles)

# Setting function as variable
fetch_articles = extract_article_text(request_list_14)

# Creating dataframes with fetched results as column (total- 30 dataframes)
text = pd.DataFrame(fetch_articles)

# Exporting as csv
text.to_csv('articles14.csv')

## Compiling Final DataFrame
---
#### Appending Article Content to Language DataFrame
---

In [None]:
article_content = pd.read_csv('article_content.csv')  # Reading article content csv 

article_content['0'] = language_dataframe['Article Content'] # Transforming into column

In [None]:
language_dataframe.head(2)  # Inspecting progress

---
## Saving Results 

---

#### Exporting as CSV  
- Save `Language` `DataFrame` as csv file  
- Read csv back in to ensure export was successful
- Will undergo `Natural` `Language` `Processing` and `Topic` `Modelling` in subsequent notebooks.

In [None]:
language_dataframe.to_csv('language_dataframe.csv')