# Data Collection 

## Web Scraping

We will be collecting our data through web scraping.  

We are scraping from [Scraps from the Loft : Stand-Up Comedy Transcripts](https://scrapsfromtheloft.com/stand-up-comedy-scripts/).  

Packages used are *requests*, *BeautifulSoup*. 

In [1]:
import requests
from bs4 import BeautifulSoup

In [3]:
URL = 'https://scrapsfromtheloft.com/stand-up-comedy-scripts/'
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

In [78]:
comedian_col = []
date_col = []
title_col = []
subtitle_col = []
transcript_col = []

transcripts = soup.find_all('h3', class_ = 'elementor-post__title')[4:]

for t in transcripts : 
    href = t.find('a').get('href')
    t_page = requests.get(href)
    t_soup = BeautifulSoup(t_page.content, "html.parser")
    
    comedian_and_title = t_soup.find('h1', class_ = 'elementor-heading-title').text
    comedian_and_title_list = comedian_and_title.split(': ')
    
    if len(comedian_and_title_list) == 2 : 
        comedian = comedian_and_title_list[0].strip()
        title = comedian_and_title_list[1].strip()
        
    elif len(comedian_and_title_list) == 1 : 
        comedian = 'NA'
        title = comedian_and_title_list[0].strip()
    
    elif len(comedian_and_title_list) == 3 and comedian_and_title_list[0] == 'Dave Chappelle' and comedian_and_title_list[2] == 'The Punchline – Transcript':
        
        comedian = (comedian_and_title_list[0])
        title = (': '.join(comedian_and_title_list[1:]))
        
    date = t_soup.find('span', class_ = 'elementor-post-info__item--type-date').text.strip()
    transcript = t_soup.find("div", class_="elementor-widget-theme-post-content").text.strip()
    
    try:
      subtitle = t_soup.find("div", class_="elementor-widget-theme-post-excerpt").text.strip()
    except AttributeError:
      subtitle = 'NA'
    
    # append to columns 
    comedian_col.append(comedian)
    date_col.append(date)
    title_col.append(title)
    subtitle_col.append(subtitle)
    transcript_col.append(transcript)
    

In [79]:
import pandas as pd

df_dic = {
    'Comedian': comedian_col,
    'Date': date_col, 
    'Title': title_col, 
    'Subtitle': subtitle_col,
    'Transcript': transcript_col
}

df = pd.DataFrame(df_dic)
df.head()

Unnamed: 0,Comedian,Date,Title,Subtitle,Transcript
0,Chris Rock,"March 8, 2023",Selective Outrage (2023) | Transcript,,[slow instrumental music playing] [funk drums ...
1,Marc Maron,"March 3, 2023",Thinky Pain (2013) | Transcript,Marc Maron returns to his old stomping grounds...,[siren wailing] I don’t know what you were thi...
2,Chelsea Handler,"March 3, 2023",Evolution (2020) | Transcript,Chelsea Handler is back and better than ever -...,Join me in welcoming the author of six number ...
3,Tom Papa,"March 3, 2023",What A Day! (2022) | Transcript,"Follows Papa as he shares about parenting, his...","Premiered on December 13, 2022 Ladies and gent..."
4,Jim Jefferies,"February 22, 2023",High n’ Dry (2023) | Transcript,Jim Jefferies is back and no topic is off limi...,"Please welcome to the stage, Jim Jefferies! He..."


In [80]:
# output as csv 
df.to_csv('transcripts.csv')

In [81]:
# pickle file for later use 
import pickle

with open('pickle/' + 'transcript.pkl', 'wb') as f:
    pickle.dump(df, f)