## Project: Web Scraping
- Name: Levi Grenier
- Date: Sep. 29, 2022

## Instructions

### Description

Both Beautiful Soup and Selenium can do the same things when it comes to webscraping. In this assignment you will use both Beautiful Soup and Selenium to scrape the below website so that you can see how they are similar in functionality as well as how Selenium has a few more advanced options. The below website has many options of movies and TV Shows to pick from. 

https://subslikescript.com/


### Grading

For grading purposes, we will clear all outputs from all your cells and then run them all from the top.  Please test your notebook in the same fashion before turning it in.

### Submitting Your Solution

To submit your notebook, first clear all the cells. Then use the File->Download As->Notebook to obtain the notebook file. You will submit a .zip file with this notebook, the txt from the Beautiful Soup portion, and two more txt's from the Selenium portion.  Finally, submit the .zip file on Canvas.

### Part 1: Beautiful Soup

### Setup

In [None]:
from bs4 import BeautifulSoup
import requests # sends requests to a website
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### Problem 1: Setup Beautiful Soup (5 points)
Complete the following tasks:

1. Select a movie or TV show on the website that you want to scrape.
2. Send a request to the website and then call Beautiful Soup on that request.
3. print the result of implementing Beautiful Soup to check your work.

In [None]:
website = 'https://subslikescript.com/movie/Kaguya-sama_Love_Is_War-9816396'
result = requests.get(website)
context = result.text
soup = BeautifulSoup(context, 'lxml')

### Problem 2: Website Exploration (10 points)

a. Provide code below to produce answers to the following questions (edit this cell with your answers): 

    1. What is the title of the movie or TV show that you chose?
    
    Kaguya Sama: Love is War (2019) - full transcript

    2. How do you know that this is the title of the article? (think the tag)

    We used the h1 tag. This is the first header tag, so it is typically the article's title. 

b. Use Beautiful Soup to print the text of the title.

c. Use Beautiful Soup to print the basic HTML code of the chosen movie or TV show.     

d. Use Beautiful Soup to print the text of the full script of the movie or TV show.

In [None]:
# a
title = soup.find('h1').get_text()


In [None]:
# b
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
print(title)

In [None]:
#c. 
print(soup.prettify())

In [None]:
# d. 
full_script_text = soup.find('div', class_="full-script").get_text()
print(full_script_text)

### Problem 3: Create the txt File

Now send the entire script of the movie or TV show to a txt file. 

With the name of the txt file as **Beautiful Soup** with the title of the chosen article. 

For example: If I chose Titanic the name of the txt file would be:

    1. 'Beautiful Soup Titanic (1997) - full transcript.txt'

In [None]:
title = title.replace(':', "",1)
with open(f'Beautiful Soup {title}.txt', 'w', encoding="utf-8") as file:
    file.write(full_script_text) # can also use the entire_script_text variable they are the same thing

### Reflection

Now assume at this point that you wanted to look at the script of a different movie. If you were just using Beautiful Soup would it be possible to by **only using code** chose a different movie? How would you have to choose a different movie if Beautiful Soup was your web scraping method of choice?

**Discuss new results**
> I don't think so. It doesn't even look like there's a suitable URL in the HTML code that I could parse out and then use. I suppose I could not get there only using code and the given webpage. 
>

### Part 2: Selenium

### Setup

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import time
import pandas as pd
from pandas import DataFrame

### Problem 4: Setup Selenium
We are going to use the **same** movie or TV show as we used above. 

Complete the following tasks:

1. Define the website
2. Define the path to the driver.
3. Create the driver
4. Use the driver to get the website. 

In [None]:
website = 'https://subslikescript.com/movie/Kaguya-sama_Love_Is_War-9816396'
path = '/Users/levig/Documents/Mines F22/CSCI 303/chromedriver.exe' 
driver = webdriver.Chrome(path) 
driver.get(website)

### Problem 5: Website Exploration With Selenium

**Note that when you use the driver to find_elements it returns a list. You cannot use .text on a list so you have to access the element inside of that resulting list to use .text and print the title.**

a. Provide code below to produce answers to the following questions (edit this cell with your answers): 

    1. What is the title of the movie or TV show that you chose?

    KAGUYA-SAMA: LOVE IS WAR (2019) - FULL TRANSCRIPT
    
    2. What did you use to find_element by? i.e. XPath, ID, TAG_NAME?
    
    TAG_NAME
    
    3. If you used XPath, what is the XPath of the title? 
       If you used ID, what is the ID of the title?
       If you used the TAG_NAME, what is the TAG_NAME of the title?
       
       TAG_NAME: h1

b. Use Selenium to print the text of the title. 
        
c. Use Selenium to print the text of the main article of the chosen movie or TV show.

    1. What did you use to find_element by? i.e. XPath, ID, TAG_NAME?
    
    I used XPATH.
    
    2. If you used XPath, what is the XPath? 
       If you used ID, what is the ID?
       If you used the TAG_NAME, what is the TAG_NAME?

    XPATH: '//article[@class=\'main-article\']'
    
    It took me a bit to realize that I needed the escape character before the other quotation marks.

d. Use Selenium to print the text of the full script of the movie or TV show.

    1. What did you use to find_element by? i.e. XPath, ID, TAG_NAME?
    
    2. If you used XPath, what is the XPath? 
       If you used ID, what is the ID?
       If you used the TAG_NAME, what is the TAG_NAME?


In [None]:
# a.
title2 = driver.find_element(By.TAG_NAME, 'h1').text

In [None]:
# b.
print(title2)

In [None]:
# c.
article2 = driver.find_element(By.XPATH, '//article[@class=\'main-article\']') # I know it's the first time I'm using "article", but I just want it to be consistent.
print(article2.text)

In [None]:
# d.
full_script_text2 = driver.find_element(By.XPATH, '//div[@class=\'full-script\']').text
print(full_script_text2)

### Problem 6: Create the txt File

Now send the entire script of the movie or TV show to a txt file. 

With the name of the txt file as **Selenium** with the title of the chosen article. 

For example: If I chose Titanic the name of the txt file would be:

    1. 'Selenium Titanic (1997) - full transcript.txt'

In [None]:
title2 = title2.replace(':',"")
with open(f'Selenium {title2}.txt', 'w', encoding="utf-8") as file:
    file.write(full_script_text2) # can also use the entire_script_text variable they are the same thing

### Reflection

At this point you have done the exact same thing with Selenium that you have done with Beautiful Soup. Take a few minutes and reflect on the following questions: 

    1. At this stage in the assignment has Selenium or Beautiful Soup been easier to use? Why has this method been easier to use?
    2. If you were to implement webscraping in the future would you prefer to use Beautiful Soup or Selenium?
    3. What aspects of Beautiful Soup did you prefer over Selenium? What aspects of Selenium did you prefer over Beautiful Soup?


**Put your answers here**(Edit this cell)
> 1. It has not been easier to use. It took me a lot longer to understand the XPATH, ID, and TAG_NAME stuff. 
> 2. Probably Beautiful Soup, but it dpends on the context. Selenium is powerful.
> 3. I liked how unpolished it was and how it didn't try to do things for me. I like knowing exactly what's happening to my data. With Selenium, I'm not sure how it is handling formatting and other things. Fewer variables. 

### Problem 7: Choosing a Different Movie/TV Show

By now I am sure you are tired of the same movie!

I'm curious on if any movies start with the letter 'X'.

Can you help me out and look at movies that start with the letter X?

**hint: you will have to push 3 different buttons for this**

**ALSO THIS PAGE HAS POP UP ADS. THE WEBSITE FOLLOWS ALONG IN LIVE TIME AS YOU CLICK ON THESE LINKS SO IF A POP UP AD APPEARS JUST CLICK 'CLOSE' ON THE POP UP AD AND KEEP SELENIUM'ING AWAY**

In [None]:
# click on the movie link to go back
movie_button = driver.find_element(By.XPATH, '//a[@href=\'/movies\']')
movie_button.click()

In [None]:
# click on the 'X' link to look at movies that start with the letter X
x_button = driver.find_element(By.XPATH, '//a[@href=\'/movies_letter-X\']')
x_button.click()

In [None]:
# now click on your movie
x_men_button = driver.find_element(By.XPATH, '/html/body/div/div/main/article/ul/a[15]') #I found an option in Inspect that said "copy XPATH". Looks like it works!
x_men_button.click()

### Problem 8: Create the txt File and Supporting Information

Now that the new movie/TV show is loaded up we are almost done!

Do these next 3 things:

    1. Get the title of the movie/TV show.
    
    2. Get the full-script of the movie/TV show.
    
    3. Create a txt file of the full-script with the naming scheme for the txt as 'Selenium2' and then the title then .txt. So say I choose Titanic the name of my txt to submit would be: 'Selenium2 Titanic (1997) - full transcript.txt'.
    
    4. Lastly, remember to tell the driver to quit as we are done scraping!

In [None]:
# get the title and then put it into text form
title3 = driver.find_element(By.TAG_NAME, 'h1').text
print(title3)

In [None]:
# get the full-script and then put it into text form
full_script_text3 = driver.find_element(By.XPATH, '//div[@class=\'full-script\']').text
print(full_script_text3)

In [None]:
# create the txt file
title3 = title3.replace(':',"")
with open(f'Selenium2 {title3}.txt', 'w', encoding="utf-8") as file:
    file.write(full_script_text3)

In [None]:
# we are done scraping so quit the driver.
driver.quit()

## Part 3: Data Visualization With Web Scraping

Now what we want to do is utilize our web scraping skills to implement data manipulation techniques that we have learned earlier in this course. 

Your task for this last part is to load in that **soccer_data.csv** file that we created at the end of the **Selenium WebScraper** lesson and run any data manipulation techniques on that data that you would like. You can put it into a DataFrame, run a classification model on it, show visulazations, etc. Anything that we have learned in this course up to this point is free game. 

## Problem 9: Data Manipulation on Preprocessed Data

In [None]:
# read in the csv (make sure the csv is in the same directory as this project)
soccer_data = pd.read_csv('soccer_data.csv')

In [None]:
# do your desired data manipulation technique!
import numpy as np

score_data = soccer_data['score']
home_score = []
away_score = []
total_score = []
team_wins = dict()
team_goals = dict()
#for date, home, score, away in soccer_data:
for i in range(len(soccer_data)):
    # Scores
    scores = soccer_data.iloc[i,2].split(" - ")
    home_score.append(int(scores[0]))
    away_score.append(int(scores[1]))
    total_score.append(int(scores[0])+int(scores[1]))
    
    # Winner count
    winner = ""
    home = soccer_data.iloc[i,1]
    away = soccer_data.iloc[i,3]
    if int(scores[0]) > int(scores[1]):
        winner = home
    elif int(scores[0]) < int(scores[1]):
        winner = away
    else:
        continue    
    if team_wins.get(winner) is None:
        team_wins[winner] = 1
    else:
        team_wins[winner] += 1
    
    # Team goals
    if team_goals.get(home) is None:
        team_goals[home] = int(scores[0])
    else:
        team_goals[home] += int(scores[0])
    if team_goals.get(away) is None:
        team_goals[away] = int(scores[1])
    else:
        team_goals[away] += int(scores[1])
        
# Scatterplot of score combinations
plt.scatter(home_score, away_score)
plt.title("Score Combinations")
plt.xlabel("Home Score")
plt.ylabel("Away Score")
plt.show()

# Histogram of average scores
plt.hist(total_score, bins = max(total_score))
plt.title("Score Distribution")
plt.xlabel("Total Score")
plt.ylabel("Frequency")
plt.show()
print(f"Average score: {sum(total_score)/len(total_score)}")

# Scatter plot of gaols and wins
wins = []
goals = []
for k,v in team_wins.items():
    wins.append(v)
    goals.append(team_goals[k])

# Relationship between total goals and total wins
plt.scatter(wins, goals)
plt.title("Total Goals vs Wins for Individual Teams")
plt.xlabel("Total Wins")
plt.ylabel("Total Goals")
plt.show()
correlation = np.corrcoef(wins, goals)
print(f"Correlation between total goals and total wins: {correlation[0][1]}")

### Reflection

At this point you have now done some sort of data manipulation with data that you scrapped online! This is important because most data that you will work with will not be given to you. You have to go out and collect the data in order to work with it which is exactly what we did here. Answer the following reflection questions regarding Part 3:
   
   1. What data manipulation technique did you do on the data?
   2. What did the results of your data manipulation tell you about the data? Any hidden meanings or things of value that you found in the data?
   
**Put your answers here**(Edit this cell)
    
>    I made visuals showing the score combinations, the score distribution, and the total score vs total wins for each of the teams. 

    
>    It told me that total score and total wins are highly correlated. It told be that the average number of goals in a game was 2.82. It also showed me a "limiting distance", so to speak, in the score combinations plot. 
    However, this analysis didn't tell me much -- it was very simple. I would like to do a cluster analysis including the difference between the winner and loser's scores and the date as features. The question this would be looking to answer is if there are teams that have similar patterns of dominance over time.
    (Sorry I didn't play with this more. I need to catch a flight tomorrow morning.)


## You Finished! Treat yourself by taking this questionnaire
### Questionnaire
1) How long did you spend on this assignment?
<br>Like three hours.<br>
2) What did you like about it? What did you not like about it?
<br>I'm in a rush, so that's tainting my perception of it. that being said, I liked the concept quite a lot. I specially liked how you left a sort of free-response exploratory question at the end. I would like to see that in other projects (but it might be more effective to place that question as it's own mini-assignment as having it at the end of a difficult assignment like this will probably make people explore less).<br>
3) Did you find any errors or is there anything you would like changed?
<br>You said that there wasn't data on the website for 22/23 season in the Selenium tutorial, but there is now. <br>