# FashionFootprint Week 4 - QQQ Report

## **Q1**  How can we find fashion-related YouTube videos? *(Salley)*

### Qualitative:
#### Problem - 
- How can we use YouTube API to search through YouTube videos to find fashion-related ones we can use for web scraping?
- Will we be able to get enough data from YouTube?  
#### Hypothesis & Claim - 
- We should be able to search through fashion data by matching titles to a set of fashion-related keywords
- Then extract links from the video descriptions  
#### Context, Motivation & Rationale - 
- We want the Chrome Extension to work on YouTube, so seeing if we can access links and scrape them to provide feedback is an important first step.  
- Having a dataset of existing YouTube videos may also help us test our tool on a smaller scale while in the development stages
#### Rationale, Assumptions, Biases - 
- Assuming all YouTube data gathered is accurate and reliable
- My rationale in selecting my keywords is my own knowledge of how fashion-related YouTube videos are titled
- I may be biased towards certain keywords/titles due to my watching certain types of fashion content
#### Definitions, Data, and Methods - 
- Using YouTube API and YouTube Data  
- Using similar methods to our YouTube lab from Week 1 & 2 (getting video data - snippet)

### Quantitative:

In [None]:
from googleapiclient.discovery import build
import pandas as pd
from datetime import datetime
import re

# set up YouTube Data API 
api_key = "AIzaSyB4rEWrBMhi4lJEgfwsV386f44qwL3HxG4"
youtube = build('youtube', 'v3', developerKey=api_key)

In [None]:
# search by keywords
keywords = ['haul', 'clothing', 'clothes', 'shop', 'shopping', 'try on', 'try-on']

# set time period to be in 2023
published_after = datetime(2023, 1, 1).isoformat() + 'Z'
published_before = datetime(2023, 12, 31).isoformat() + 'Z'

In [None]:
# search for videos
search_response = youtube.search().list(
    q=keywords, 
    part='snippet',
    type='video',
    publishedAfter=published_after,
    publishedBefore=published_before,
    maxResults=1 # returns 1 result(video) that match w/criteria
).execute()

In [None]:
# process results
videos = []
for search_result in search_response.get('items', []):
    # get and store video id
    video_id = search_result['id']['videoId']
    video_response = youtube.videos().list(
        # receive snippet part of data - title, description, tags, etc.
        part="snippet",
        id=video_id
    ).execute()

    # accesses description field of snipper
    description = video_response['items'][0]['snippet']['description']
    
    # extract links from description
    links = re.findall(r'(https?://\S+)', description)
    
    # add video title, description, and links (if there are links) to videos list
    videos.append({
        'title': search_result['snippet']['title'],
        'links': links
    })

In [None]:
# filter vids into separate lists based on keywords
filtered_vids_keywords = {}

for keyword in keywords:
    filtered_vids_keywords[keyword] = [
        video for video in videos 
        if keyword.lower() in video['title'].lower() and video['links']
    ]
for keyword, vid_list in filtered_vids_keywords.items():
    print(f"VIDEOS WITH '{keyword}':")
    for video in vid_list:
        print("title:", video['title'])
        print("links:")
        for link in video['links']:
            print(link)
        print()

VIDEOS WITH 'haul':
title: SUMMER SHOP WITH ME!!🛒🎀 get out of a fashion rut, collective clothing haul, outfit inspiration
links:
https://www.aritzia.com/us/en/product/the-effortless-pant%E2%84%A2/96000.html?dwvar_96000_color=23914
https://www.aritzia.com/us/en/product/contour-mockneck-tank/83839.html?dwvar_83839_color=30252&dwvar_83839_size=3
https://www.aritzia.com/us/en/product/the-effortless-short%E2%84%A2-lo-rise-3%22/109952.html?dwvar_109952_color=11420
https://www.zara.com/us/en/full-length-trf-high-rise-wide-leg-jeans-p06045025.html?v1=277681186
https://us.princesspolly.com/products/city-of-angels-pant-spanish-grey?currency=USD&variant=39691555340372&utm_medium=cpc&utm_source=google&utm_campaign=Google%20Shopping&utm_source=cpc&utm_medium=google&utm_term=&adid=&matchtype=&addisttype=xpla&tw_source=google&tw_adid=&tw_campaign=19750607918&gclid=CjwKCAjwkeqkBhAnEiwA5U-uM3GDG8GiIhgeP9xrScVE8307JC-uystVIejBh10tUP3vNTA-8ekLDhoCE5cQAvD_BwE
https://www.birkenstock.com/us/boston-suede-le

### Qualitative (pt. 2):
#### Answer/Update to Question/Claim
- How can we use YouTube API to search through YouTube videos to find fashion-related ones we can use for web scraping?
   - Search for videos through using a list of keywords, then access their descriptions to extract links
- Will we be able to get enough data from YouTube?
   - There is a quota limit of 10,000 units per day 
- We should be able to search through fashion data by matching titles to a set of fashion-related keywords
   - Yes!
- Then extract links from the video descriptions
   - Yes!
#### Summary & Re-contextualization
- We are able to extract relevant videos by searching for keywords in their title that relate to fashion
- We are able to extract links from their descriptions
#### Story & Domain Knowledge
- Will apply knowledge gained in initial data pull from YouTube to future data pulls!
- Learned about YouTube API - how to gather specific types of data and organize/format it
#### Uncertainty, Limitations & Caveats
- We can only get a maximum of 50 results that match with a keyword
- There are links outputted that are not relevant (e.g., social media links)
- Not organized by super relevant keywords (e.g., clothing and try on)
#### New Problems & Next Steps
- Is there a better way to organize and extract data?
- How can we extract only relevant links?

## **Q2** Is there a better way to organize and extract data? *(Salley)*

### Qualitative:
#### Problem - 
- How can we better organize data?
- How can we get the max amount of data given the quota and results limit?
- How can we get rid of irrelevant links?
#### Hypothesis & Claim - 
- We can perform searches on specific brands initially, rather than keywords
- We can filter on fashion-related keywords after
- We can remove social media links by searching for keywords within the link itself
#### Context, Motivation & Rationale - 
- We can better organize data by brands
- We can extract more data since brands are more specific than keywords
- Our data will look cleaner and more organized if we remove social media links
#### Rationale, Assumptions, Biases - 
- Assuming all YouTube data gathered is accurate and reliable and are filtered by keywords I selected
- My rationale in selecting my keywords is my own knowledge of fashion, social media, etc.
- I may be biased towards selecting certain brands in my initial search due to perosnal opinions on them
- I may be biased towards filtering out certain social media links due to my personal knowledge of them
#### Definitions, Data, and Methods - 
- Similar methods as before with more filtering

### Quantitative

In [None]:
social_media_links = ['pinterest', 'youtube', 'twitter', 'instagram', 'tiktok', 'reddit', 'twitch', 'facebook', 'thmatc']

In [None]:
# search for videos
am_eagle_search_results = youtube.search().list(
    q='American Eagle', # search by a specific brand rather than set of fashion-related keywords
    part='snippet',
    type='video',
    publishedAfter=published_after,
    publishedBefore=published_before,
    maxResults=50
).execute()

In [None]:
am_eagle_videos = []
for search_result in am_eagle_search_results.get('items', []):
    video_id = search_result['id']['videoId']
    video_response = youtube.videos().list(
        part="snippet",
        id=video_id
    ).execute()

    description = video_response['items'][0]['snippet']['description']
    
    links = re.findall(r'(https?://\S+)', description)

    # makes all titles lowercase so code can match on any version of title:
    # (e.g., American Eagle, american eagle, AMERICAN EAGLE)
    title = search_result['snippet']['title'].lower()

    # filters based on 'american eagle' in title and fashion-related keywords
    if 'american eagle' in title and any(keyword in title for keyword in keywords):
        # filters out social media links
        filtered_links = [link for link in links if not any(keyword in link for keyword in social_media_links)]

        am_eagle_videos.append({
            'title': search_result['snippet']['title'],
            'links': filtered_links
        })

In [None]:
# only output videos with links in bios we can scrape
am_eagle_youtube_data = []

for video in am_eagle_videos:
    # check if vid has links
    if video['links']:
        # append the video to the filtered list
        am_eagle_youtube_data.append({
            'Title': video['title'],
            'Links': '\n'.join(video['links'])
        })

# formatted this way so it can be easily converted to csv using to_csv function from pandas
print(am_eagle_youtube_data)

[{'Title': 'American Eagle Denim Haul | Try On | BRUTALLY Honest Review', 'Links': 'https://shopltk.com/explore/Stephanie_Lauer/collections/11ee0ae3055ebff8abd40242ac110003'}, {'Title': 'Shopping While Curvy: Aerie + American Eagle || matching sets &amp; are the jeans curvy friendly?', 'Links': 'https://www.shoplivinfearless.com/\nhttps://kinkistry.com/collections/wefted-hair-closures/products/kinknesis-wefted-bundle?variant=6883127623744\nhttps://www.amazon.com/shop/livin_fearless\nhttps://youtu.be/UzmokCNkUEs\nhttps://youtu.be/gV9shyqatew\nhttps://rstyle.me/+JqMoH6aLHynGpkNoFEsOYQ\nhttps://rstyle.me/+jEAbFk4qNh_glPy2rZ4rdw\nhttps://rstyle.me/+msFu67UM9cc9OPuOQWZHrA\nhttps://rstyle.me/+IO41CFOlJAAB52Mm1N3pDw\nhttps://rstyle.me/+-JDPdOWnc2z3W17e9JTUVA\nhttps://rstyle.me/+WjFzJQzuaDHfgN3RExjR9A\nhttps://rstyle.me/+Rmjw6q4RE3LOAKek0laGYQ\nhttps://rstyle.me/+24Ir1CuXN1iwf3jvejJeTQ\nhttps://rstyle.me/+uxixSRGpy-HyJdfCui_hAg\nhttps://rstyle.me/+oDUkaCUcuIuGWay_ZCuVEw\nhttps://rstyle.me/+TO2

### Qualitative (pt. 2):
#### Answer/Update to Question/Claim
- How can we better organize data?
   - We are able to find videos and search based on specific brands
   - We can then filter further by searching for fashion-related keywords in the title
- How can we get the max amount of data given the quota and results limit?
   - Filtering by brands gets us more data since we can extract up to 50 fashion-related videos from a specific brand rather than 50 fashion-related videos overall
- How can we get rid of irrelevant links?
   - We can filter out certain links by matching by specific social medias
#### Summary & Re-contextualization
- Organize by brand and separate them
- Convert data into csv's, organized by brand
- Filter out some irrelevant links from description
#### Story & Domain Knowledge
- Learning from knowledge gained in initial data pull from YouTube to increase amount of data that I can gather
- Learned about how to organize and format data to be more clean
- More knowledge about filte
#### Uncertainty, Limitations & Caveats
- Takes a long time - limits number of brands we can gather data for
- Do we need even more data to train our tool/model?
#### New Problems & Next Steps
- How can we make the code itself more efficient? Right now, I have to rewrite the same code over and over again for each brand.
- Any further ways we can organize data to make it easier to scrape links? Add more specific columns?

## **Q3**  How can we scrape online store product's material composition/percentages? *(Jasmine)*

### Qualitative:
#### Problem -
- How can we use web scraping to get the material composition for clothing on online stores?
#### Hypothesis & Assumptions -
- My initial assumption is that this would be an easy task, as web scraping looks at the HTML and extracts elements
#### Context, Motivation & Rationale -
- We want the Chrome Extension to access the links provided in the description of YouTube hauls on clothing, so that we can find the material and generate a sustainability score for the items featured in the video
#### Definitions, Data, and Methods -
- Using online stores' clothing links
- Using Python language and web scraping libraries (Beautiful Soup and Requests). 
#### Biases & Assumptions
- Brand selection may be biased based on subjective selection criteria based on personal preference.
- Selecting brands that are more visible or easily accessible online introduces availability bias.
- There might have been brands chosed based on confirmation bias where we believed confirmed their preconceived notions of sustainability or lack of.
- An assumption made for these brands was that they are popular and many people shop from them.

### Quantitative:

Step 1: we want to import the libraries

In [None]:
from bs4 import BeautifulSoup
import requests
import time

Step 2: then we find links that we want to scrape, for this I will use an American Eagle clothing item:
https://www.ae.com/us/en/p/women/t-shirts/view-all-t-shirts-/ae-hey-baby-ribbed-t-shirt/2370_9630_275?menu=cat4840004


In [None]:
url = "https://www.ae.com/us/en/p/women/t-shirts/view-all-t-shirts-/ae-hey-baby-ribbed-t-shirt/2370_9630_275?menu=cat4840004"


Step 3: use basic web scraping to get the material percentage, search the html by looking for "%"

In [None]:
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
composition_elements = soup.find_all(string=lambda text: '% ' in str(text).lower())
if composition_elements:
    for element in composition_elements:
        print(element)
else:
    print("none")

none


none

### Qualitative (pt. 2):
#### Answer/Update to Question/Claim
- How can we use web scraping to get the material composition for clothing on online stores?
    - We can use scraping to obtain the HTML page source, but it doesn't seem to be outputting what I want.
    - My method cannot find "%" on the page why is that?
- Domain Knowledge
    - Learned about how to do basic web scraping
    - Learned that some sites have to do more complex web scraping with interactivity involved.
#### Summary
- We aren't able to find the materials on American Eagle with basic web scraping functionality.
- I was able to scrape a few websites using this functionality where the materials information was directly on the page and not hidden behind a button.
#### Uncertainty, Limitations & Caveats
- The materials are on the page when I look at it manually.
- Why does this method work for some websites and not others? (This simple method works for some sites/links but not all)
- The content seems to be loaded in by JavaScript.
#### New Problems & Next Steps
- Is there a way to access the materials?
- Find a way to access the materials when they are dynamically loaded.

## **Q4**  How can we scrape online store product's material composition/percentages that are dynamically loaded in by JavaScript? *(Jasmine)*

### Qualitative:
#### Problem -
- Is there a way to access the materials that are loaded in by JavaScript?
- Web Scraping is not as easy as I originally thought.
#### Hypothesis -
- If HTML page sources do not store JavaScript loaded content, I must be able to access it somehow.
#### Context, Motivation & Rationale -
- We want to obtain more material information for many brands not just be limited to a few based on the simple code from above.
#### Definitions, Data, and Methods -
- Using Selenium WebDriver, a library and tool used for automated web scraping that allows you to control a web browser, interact with dynamic elements, and scrape the resulting content. 

### Quantitative:

Step 1: import relevant libraries and tools

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from lxml import html

Step 2: examine the website find what elements need to be interacted with, get the XPATH of the interactive and loaded in content.

In [None]:
url = "https://www.ae.com/us/en/p/women/t-shirts/view-all-t-shirts-/ae-hey-baby-ribbed-t-shirt/2370_9630_275?menu=cat4840004"
interactive_element_xpath = '//*[@id="main-content-focus"]/div[2]/div[2]/div[2]/div/div[3]/div/div[1]/div[1]'
loaded_content_xpath = '//*[@id="main-content-focus"]/div[2]/div[2]/div[2]/div/div[3]/div/div[1]/div[2]/div/div[2]'

Step 3: use Selenium and WebDriver, interact with the page and scrape the loaded in material information

In [None]:
def scrape_american_eagle(url, interactive_element_xpath, loaded_content_xpath, wait_time=10):
    try:
        # Set Chrome 
        chrome_options = Options()

        # Initialize the WebDriver 
        driver = webdriver.Chrome(options=chrome_options)
        
        # Open the webpage
        driver.get(url)

        # Wait for the specified time before clicking the interactive element
        time.sleep(wait_time)  # Wait for the specified time in seconds

        # Find the interactive element
        interactive_element = driver.find_element(By.XPATH, interactive_element_xpath)
        
        # Click the interactive element
        interactive_element.click()

        # Wait for the loaded content to be visible
        loaded_element = WebDriverWait(driver, wait_time).until(
            EC.visibility_of_element_located((By.XPATH, loaded_content_xpath))
        )

        # Once loaded, scrape the content
        dynamic_content = loaded_element.text
        
        return dynamic_content
        
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return None
        
    finally:
        # Close the WebDriver
        driver.quit()


dynamic_content = scrape_american_eagle(url, interactive_element_xpath, loaded_content_xpath)
if dynamic_content:
    print(dynamic_content)

Materials & Care
57% Cotton, 38% Recycled Polyester, 5% Elastane
Machine wash
Imported


Materials & Care

57% Cotton, 38% Recycled Polyester, 5% Elastane

Machine wash

Imported

### Qualitative (pt. 2):
#### Answer/Update to Question/Claim
- Is there a way to access the materials that are loaded in by JavaScript?
    - Yes!
- Web Scraping is not as easy as I originally thought.
    - Now I have learned about WebDriver and automation!
    - I also learned that there are anti-bot systems in place such as CAPTCHA, cloudfare, etc.
#### Summary
- I can now scrape dynamically loaded in content that I wasn't able to do before utilizing Selenium and WebDriver.
- I must use regular expressions to extract the numbers and materials.
- I was able to scrape 30 brands using these methods that I learned.
#### Uncertainty, Limitations & Caveats
- Not all websites are the same, I can't re-use this method for every single site.
- Opens a test browser that can trigger bot detection (on some sites) and I can't scrape what I need.
- Scrapes the relevant data but how can I extract it?
#### New Problems & Next Steps
- Scrape more websites!
- Cannot scrape some sites because of bot detection, for ethics reasons I won't try to bypass bots, as it can be against the terms of use for some sites.
- Use regular expressions to pull out the information into variables for sustainability score.

## **Q5** How are popular brands rated by Good On You? *(Megan)*

### Qualitative:
#### Problem - 
- How can we learn more about the sustainability of popular brands?
#### Hypothesis & Claim - 
- We will extract data from Good On You's website to create a dataset of sustainability ratings for brands.
- We will use this data to compare sustainability practices of different brands and understand possible factors that go into a sustainability rating.
#### Context, Motivation & Rationale - 
- A *brand's* sustainability practices play a major role in determining the sustainability of a specific *product*, so it's important that we have that information.
- We can also examine Good On You's brand descriptions to identify the criteria they use to evaluate sustainability ratings.
#### Definitions, Data, and Methods - 
- For each of the brands, use Beautiful Soup to go to its webpage and get the overall rating, subratings (Planet, People, Animals), and description/reasoning.
#### Assumptions - 
- Good On You has done extensive research to provide accurate and comprehensive sustainability ratings for brands.
- Good On You considers a variety of factors/criteria related to sustainability practices.

### Quantitative:

In [None]:
# step 1: create a list of brands

brands = [
    'Princess Polly',
    'Brandy Melville',
    'Shein',
    'Nike',
    'Abercrombie & Fitch',
    'Amazon',
    'ASOS',
    'Forever 21', 
    'American Eagle',
    'Alo',
    'Reformation',
    'Acne Studios',
    'Alice + Olivia',
    'Sandy Liang',
    'Billabong',
    'Adidas',
    'Aritzia',
    'Uniqlo',
    'Area',
    'Balenciaga',
    'Bottega Veneta',
    'Brooks Brothers',
    'Burberry',
    'Chanel',
    'Coach',
    'Fendi',
    'Gucci',
    'Hermes',
    'Louis Vuitton',
    'Prada',
    'Ralph Lauren',
    'Saint Laurent',
    'Stella McCartney',
    'Telfar',
    'The Row',
    'Theory',
    'Tom Ford',
    'Tory Burch',
    'Valentino',
    '7 For All Mankind',
    "Arc'teryx",
    'aventura',
    'Banana Republic',
    'Boden',
    'Buck Mason',
    'Calvin Klein',
    'Carhartt',
    'Christy Dawn',
    'Columbia',
    'Cotopaxi',
    'Dickies',
    'Djerf Avenue',
    'Doen',
    'Edikted',
    'Everlane',
    'Faithfull the Brand',
    'Frankies Bikinis',
    'Girlfriend Collective',
    'Good American',
    'House of Sunny',
    'J.Crew',
    "Levi's",
    'Madewell',
    'Organic Basics',
    'Pact',
    'Patagonia',
    'prAna',
    'Quince',
    'RE/DONE',
    'Réalisation Par',
    'REI',
    'Sezane',
    'Spanx',
    'Summersalt',
    'tentree',
    'The North Face',
    'Tommy Hilfiger',
    'True Religion',
    'Wrangler',
    'Yes Friends',
    'Aeropostale',
    'Boohoo',
    'Cider',
    'Fashion Nova',
    'GUESS',
    'Hollister',
    'Hot Topic',
    'House of CB',
    'Mango',
    'Missguided',
    'Nasty Gal',
    'PacSun',
    'PrettyLittleThing',
    'Primark',
    'Romwe',
    'Temu',
    'Topshop',
    'Torrid',
    'Under Armour',
    "Victoria's Secret",
    'Yesstyle'
]

# brands added manually (only overall rating is provided): 
# - Ann Taylor
# - Aerie
# - Garage
# - Pink

In [None]:
# step 2: get data (ratings and description) from each brand

import requests
from bs4 import BeautifulSoup
import pandas as pd

all_brand_info = []
cols = ['brand', 'overall_rating', 'planet_score', 'people_score', 'animals_score', 'description']

url = 'https://directory.goodonyou.eco/brand/'

for brand in brands:
    try:
        # convert brand name to url format
        brand_converted = brand.lower() # lowercase
        brand_converted = brand_converted.replace(' ', '-') # replace spaces with dashes
        brand_converted = brand_converted.replace('&', 'and') # replace '&' with 'and'
        brand_converted = brand_converted.replace('+', '-') # replace '+' with '-'
        brand_converted = brand_converted.replace("'", '') # remove apostrophes
        brand_converted = brand_converted.replace('.', '') # remove periods
        brand_converted = brand_converted.replace('/', '') # remove slashes
        brand_converted = brand_converted.replace('é', 'e') # remove accent

        # get data on each brand's page
        response = requests.get(url+brand_converted)
        soup = BeautifulSoup(response.text, 'html.parser')

        # note: when finding classes on GOU's website, you might need to disable JavaScript (since BeautifulSoup can't load dynamic content)

        # get overall rating 
        overall_rating = soup.find('h6', class_='StyledHeading-sc-1rdh4aw-0 jNSEQB id__OverallRating-sc-12z6g46-7 cjSjNJ')
        overall_rating = overall_rating.text.split(': ')[1]

        # get subratings for planet, people, and animals
        subratings = soup.find_all('div', class_='id__RatingSingle-sc-12z6g46-9 ksJKxw')
    
        # remove category name from text 
        subratings[0] = subratings[0].text.split('Planet')[1]
        subratings[1] = subratings[1].text.split('People')[1]
        subratings[2] = subratings[2].text.split('Animals')[1]

        # if there's a rating, convert to int (makes it easier to analyze later)
        for i, rating in enumerate(subratings):
            if rating != 'Not applicable':
                subratings[i] = int(rating.split(' ')[0])

        # get description/justification
        text = soup.find('div', class_='id__BodyText-sc-12z6g46-15 eUqrmK').text
        
        # create new list of current brand info and add data
        brand_info = []
        brand_info.append(brand)
        brand_info.append(overall_rating)
        brand_info.append(subratings[0])
        brand_info.append(subratings[1])
        brand_info.append(subratings[2])
        brand_info.append(text)
        
        # add to overall list of brand info
        all_brand_info.append(brand_info)
    except:
        print(f"{brand} is not in Good On You's Drectory")

In [None]:
# step 3: add certain brands (with limited data provided) manually 

# aerie
aerie_brand_info = []
aerie_brand_info.append('Aerie')
aerie_brand_info.append(2)
aerie_brand_info.append('Not applicable')
aerie_brand_info.append('Not applicable')
aerie_brand_info.append('')

# ann taylor
ann_taylor_brand_info = []
ann_taylor_brand_info.append('Ann Taylor')
ann_taylor_brand_info.append(2)
ann_taylor_brand_info.append('Not applicable')
ann_taylor_brand_info.append('Not applicable')
ann_taylor_brand_info.append('')

# garage
garage_brand_info = []
garage_brand_info.append('Garage')
garage_brand_info.append(2)
garage_brand_info.append('Not applicable')
garage_brand_info.append('Not applicable')
garage_brand_info.append('')

# pink
pink_brand_info = []
pink_brand_info.append('Pink')
pink_brand_info.append(2)
pink_brand_info.append('Not applicable')
pink_brand_info.append('Not applicable')
pink_brand_info.append('')

all_brand_info.append(aerie_brand_info)
all_brand_info.append(ann_taylor_brand_info)
all_brand_info.append(garage_brand_info)
all_brand_info.append(pink_brand_info)

In [None]:
# step 4: create dataframe

brand_df = pd.DataFrame(all_brand_info, columns=cols)

# replace Good On You's categories with numerical ratings
# source: https://saturncloud.io/blog/how-to-convert-categorical-data-to-numerical-data-with-pandas
brand_df['overall_rating'] = brand_df['overall_rating'].replace({
    'We avoid': 1,
    'Not good enough': 2,
    "It's a start": 3,
    'Good': 4,
    'Great': 5
})

In [None]:
brand_df

Unnamed: 0,brand,overall_rating,planet_score,people_score,animals_score,description
0,Princess Polly,2,2,2,4,Our “Planet” rating evaluates brands based on ...
1,Brandy Melville,1,1,1,0,This brand provides insufficient relevant info...
2,Shein,1,1,1,2,Our “Planet” rating evaluates brands based on ...
3,Nike,3,3,3,2,Our “Planet” rating evaluates brands based on ...
4,Abercrombie & Fitch,2,2,2,2,Abercrombie & Fitch is owned by Abercrombie Ab...
...,...,...,...,...,...,...
100,Yesstyle,1,1,1,0,This brand provides insufficient relevant info...
101,Aerie,2,Not applicable,Not applicable,,
102,Ann Taylor,2,Not applicable,Not applicable,,
103,Garage,2,Not applicable,Not applicable,,


In [None]:
## step 5: export to a CSV file
# this CSV file is saved in the 'data' folder
brand_df.to_csv('../data/brand_info.csv')

### Qualitative:
#### Answer/Update to Question/Claim
- How can we learn more about the sustainability of popular brands?
   - We can use Good On You, a source for fashion sustainability ratings, to extract ratings and rationales for popular clothing brands.
#### Summary & Re-contextualization
- We were able to get sustainability ratings for 105 brands (luxury, sustainable, fast fashion, etc.).
#### Uncertainty, Limitations & Caveats
- Some brands have not been rated by Good On You.
- So far, we are only relying on one source for ratings.
#### New Problems & Next Steps
- We plan to cross-reference these ratings with other fashion sustainability sites, such as Eco-Stylist and Sustainable Review. This will ensure a more comprehensive and balanced assessment of the brands' sustainability practices and might also give us a broader selection of brands to analyze.

## **Q6** How can we compile a list of sustainable brands recommended by Good On You? *(Megan)*

### Qualitative:
#### Problem - 
- How can we find all of the sustainable brands that have been recommended by Good On You?
#### Hypothesis & Claim - 
- We should be able to extract data from Good On You's website to create a dataset of sustainability ratings for brands.
- We will use this data to compare sustainability practices of different brands and understand possible factors that go into a sustainability rating.
#### Context, Motivation & Rationale - 
- We want our Chrome extension to provide more sustainable alternatives to users, so we need a set of possible sustainable brands to recommend.
- We also aim to analyze the reasoning behind these ratings to understand the factors that Good On You took into consideration.
#### Definitions, Data, and Methods - 
- For each of the categories listed in the directory, use Selenium to parse through their 100 recommended brands and extract the overall rating, subratings (Planet, People, Animals), and description/reasoning.
#### Assumptions - 
- Good On You's description of each sustainable brand is informative enough that we are able to understand *why* they are considered sustainable.

### Quantitative

In [None]:
## step 1: get brands recommended by Good On You

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

categories = [
    'tops',
    'dresses',
    'basics',
    'bottoms',
    'denim',
    'outerwear',
    'knitwear',
    'activewear',
    'sleepwear'
]

brands = set()

# set Chrome options for headless mode
chrome_options = Options()

# initialize WebDriver with headless mode
driver = webdriver.Chrome(options=chrome_options)

for category in categories:
    # open webpage
    driver.get('https://directory.goodonyou.eco/categories/' + category)

    # get initial height of the page
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # scroll
        # source: https://stackoverflow.com/questions/73792388/how-to-scroll-to-the-bottom-of-the-page-with-selenium-python
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        
        # wait for 2 seconds
        time.sleep(2)
        
        # get new height of page
        new_height = driver.execute_script("return document.body.scrollHeight")
        
        # stop once it gets to the bottom
        if new_height == last_height:
            break
        
        # update height
        last_height = new_height

    # 100 results are now shown on the page
    for i in range(1, 101):
        # get xpath for each brand
        brand_xpath = f'//*[@id="__next"]/div/div[4]/div/div[2]/div/div/div[{i}]/div/div/div[2]/h5/a'

        # get brand element (after 5 second delay)
        brand_element = WebDriverWait(driver, 5).until(
            EC.visibility_of_element_located((By.XPATH, brand_xpath))
        )

        # get brand name and link
        # source for link: https://stackoverflow.com/questions/54862426/python-selenium-get-href-value
        brand_name = brand_element.text
        brand_link = brand_element.get_attribute('href')

        brands.add((brand_name, brand_link))
     
# close the WebDriver
driver.quit()

In [None]:
## step 2: get data (ratings and description) from each brand

# create list of all recommended brands + their info
recommended = []
cols = ['brand', 'overall_rating', 'planet_score', 'people_score', 'animals_score', 'description']

for brand in brands:
    brand_name = brand[0]
    brand_href = brand[1]

    # get data on each brand's page
    response = requests.get(brand_href)
    soup = BeautifulSoup(response.text, 'html.parser')

    # get overall rating 
    overall_rating = soup.find('h6', class_='StyledHeading-sc-1rdh4aw-0 jNSEQB id__OverallRating-sc-12z6g46-7 cjSjNJ')
    overall_rating = overall_rating.text.split(': ')[1]
    
    # get subratings for planet, people, and animals
    subratings = soup.find_all('div', class_='id__RatingSingle-sc-12z6g46-9 ksJKxw')
    
    # remove category name from text 
    subratings[0] = subratings[0].text.split('Planet')[1]
    subratings[1] = subratings[1].text.split('People')[1]
    subratings[2] = subratings[2].text.split('Animals')[1]

    # if there's a rating, convert to int (makes it easier to analyze later)
    for i, rating in enumerate(subratings):
        if rating != 'Not applicable':
            subratings[i] = int(rating.split(' ')[0])

    # get description/justification
    text = soup.find('div', class_='id__BodyText-sc-12z6g46-15 eUqrmK').text

    # create new list of current brand info and add data
    brand_info = []
    brand_info.append(brand_name)
    brand_info.append(overall_rating)
    brand_info.append(subratings[0])
    brand_info.append(subratings[1])
    brand_info.append(subratings[2])
    brand_info.append(text)
    
    # add to overall list of brand info
    recommended.append(brand_info)

In [None]:
## step 3: create dataframe

recommended_df = pd.DataFrame(recommended, columns=cols)

# replace Good On You's categories with numerical ratings
# source: https://saturncloud.io/blog/how-to-convert-categorical-data-to-numerical-data-with-pandas
recommended_df['overall_rating'] = recommended_df['overall_rating'].replace({
    'We avoid': 1,
    'Not good enough': 2,
    "It's a start": 3,
    'Good': 4,
    'Great': 5
})

In [None]:
recommended_df

Unnamed: 0,brand,overall_rating,planet_score,people_score,animals_score,description
0,Sami Miro Vintage,4,5,3,4,Our “Planet” rating evaluates brands based on ...
1,Ognx,3,3,3,Not applicable,Our “Planet” rating evaluates brands based on ...
2,Mantis World,5,4,4,5,Mantis World's environment rating is 'good'. I...
3,Le Gramme,3,4,2,Not applicable,Our “Planet” rating evaluates brands based on ...
4,WAXON,4,3,3,4,WAXON's environment rating is 'it's a start'. ...
...,...,...,...,...,...,...
451,Luva Huva,4,4,3,Not applicable,Our “Planet” rating evaluates brands based on ...
452,Pareto,4,5,3,5,Our “Planet” rating evaluates brands based on ...
453,Viktoria and Woods,3,3,3,3,Viktoria and Woods's environment rating is 'it...
454,London W11,4,5,3,4,London W11's environment rating is 'great'. It...


In [None]:
## step 4: export to a CSV file

# this CSV file is saved in the 'data' folder
recommended_df.to_csv('../data/gou_recommended.csv')

### Qualitative:
#### Answer/Update to Question/Claim
- How can we find all of the sustainable brands that have been recommended by Good On You?
   - We can use Selenium to find 100 sustainable brands for each clothing category.
#### Summary & Re-contextualization
- We were able to get sustainability ratings for 456 brands from 9 clothing categories.
#### Uncertainty, Limitations & Caveats
- While Good On You provides valuable sustainability ratings for a wide range of brands, it is important to note that their database may not encompass all sustainable brands in the market. Some brands, especially smaller or newer ones that may not have been evaluated by Good On You yet, could be missing from their ratings.
#### Next Steps
- Our Chrome extension will have a feature to recommend more sustainable alternatives from these brands.

## **Q6** How are sustainable brands rated on Eco-Stylist? *(Sabrina)*

### Qualitative:

#### Problem - 
- Our team can collect information about brands directly from their websites, but the information required to make an accurate assessment of brand sustainability is often unavailable
- What criteria do established organizations like Eco-Stylist use to rate sustainability?

#### Hypothesis & Claim - 
- We will extract data from Eco-Stylist's website to create a dataset of sustainabiltiy ratings for brands
- Examining the criteria used Eco-Stylist for evaluating brands will help us develop our own formula for rating sustainability

#### Context, Motivation & Rationale - 
- Established organizations focused on sustainability in fashion like Eco-Stylist have already dedicated time and resources to collect and analyze the relevant data for rating brands
- Eco-stylist also offers transparency about what criteria they consider for their overall sustainability score
- We are motivated to extract data from Eco-Stylist to have sustainability ratings to compare our own against
- Webscraping with the Python library BeautifulSoup because it is popular and well documented

#### Definitions, Data, and Methods -
- Eco-Stylist is a guide to ethical and eco-friendly fashion
- Data will be collected from the Eco-Stylist website
- Webscraping with the Python library BeautifulSoup to visit the webpage for each brand and get the overall rating and subratings (Transparency, Fair Labor, Sustainably Made)

#### Biases & Assumptions -
- The research which informs the subratings is reliable and fact checked
- All brands which Eco-Stylist has collected sustainability data on are publicly available

### Quantitative:

In [None]:
# imports
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import pandas as pd
import re
import csv

In [None]:
# create session
session = requests.Session()
# retry three times in case of exception
retry = Retry(connect=3, backoff_factor=0.5)
# apply delays between attempts
adapter = HTTPAdapter(max_retries=retry)
session.mount('https://', adapter)

In [None]:
# scrape urls to brand reviews
eco_stylist_brands = session.get("https://www.eco-stylist.com/sustainable-brands/")
content = BeautifulSoup(eco_stylist_brands.text, 'html.parser')
# all links
page_links = content.find_all("a")
# links to brand reviews
brand_urls = set(link.get('href') for link in page_links if ("https://www.eco-stylist.com/ethical-brand/" in link.get('href')) and (link.get('href') != "https://www.eco-stylist.com/ethical-brand/"))

print(brand_urls)

{'https://www.eco-stylist.com/ethical-brand/taylor-stitch/', 'https://www.eco-stylist.com/ethical-brand/reprise/', 'https://www.eco-stylist.com/ethical-brand/unspun/', 'https://www.eco-stylist.com/ethical-brand/groceries-apparel/', 'https://www.eco-stylist.com/ethical-brand/po-zu/', 'https://www.eco-stylist.com/ethical-brand/kotn/', 'https://www.eco-stylist.com/ethical-brand/toadco/', 'https://www.eco-stylist.com/ethical-brand/hernest-project/', 'https://www.eco-stylist.com/ethical-brand/naadam/', 'https://www.eco-stylist.com/ethical-brand/edwin/', 'https://www.eco-stylist.com/ethical-brand/coalatree/', 'https://www.eco-stylist.com/ethical-brand/known-supply/', 'https://www.eco-stylist.com/ethical-brand/ten-thousand-villages/', 'https://www.eco-stylist.com/ethical-brand/isto/', 'https://www.eco-stylist.com/ethical-brand/kindom/', 'https://www.eco-stylist.com/ethical-brand/no-nasties/', 'https://www.eco-stylist.com/ethical-brand/beckett-simonon/', 'https://www.eco-stylist.com/ethical-br

In [None]:
# create dataframe 
eco_stylist_df = pd.DataFrame(columns=['Brand', 'Overall', 'Transparency', 'Fair Labor', 'Sustainably Made', 'URL'])

In [None]:
# scrape for brand data
for url in brand_urls:
    # search for brand review
    review = session.get(url)
    
    try:
        # content
        content = BeautifulSoup(review.text, 'html.parser')

        # brand name
        brand = content.find('h1').get_text()

        # overall rating
        overall = content.find(string=re.compile("Overall Rating:")).split(" ")[2]
        
        # transparency, fair labor, sustainably made
        ratings = content.find_all(string=re.compile("Rated:"))
        transparency = ratings[0].split(" ")[1]
        fair_labor = ratings[1].split(" ")[1]
        sustainably_made = ratings[2].split(" ")[1]

        # update dataframe
        eco_stylist_df.loc[len(eco_stylist_df.index)] = [brand, overall, transparency, fair_labor, sustainably_made, url]

    except:
        print(url + " Failed to Load")

In [None]:
# visualize first 10 rows of the new dataset
eco_stylist_df.head(10)

Unnamed: 0,Brand,Overall,Transparency,Fair Labor,Sustainably Made,URL
0,Taylor Stitch,Silver,Good,Good,Excellent,https://www.eco-stylist.com/ethical-brand/tayl...
1,Reprise,Certified,Good,Good,Excellent,https://www.eco-stylist.com/ethical-brand/repr...
2,Unspun,Certified,Good,Fair,Excellent,https://www.eco-stylist.com/ethical-brand/unspun/
3,Groceries Apparel,Certified,Good,Good,Good,https://www.eco-stylist.com/ethical-brand/groc...
4,Po-Zu,Certified,Good,Good,Excellent,https://www.eco-stylist.com/ethical-brand/po-zu/
5,Kotn,Silver,Excellent,Good,Good,https://www.eco-stylist.com/ethical-brand/kotn/
6,Toad&Co,Silver,Fair,Good,Excellent,https://www.eco-stylist.com/ethical-brand/toadco/
7,Hernest Project,Certified,Excellent,Good,Good,https://www.eco-stylist.com/ethical-brand/hern...
8,NAADAM,Certified,Good,Good,Excellent,https://www.eco-stylist.com/ethical-brand/naadam/
9,EDWIN,Certified,Excellent,Fair,Excellent,https://www.eco-stylist.com/ethical-brand/edwin/


In [None]:
# export csv file
eco_stylist_df.to_csv('../data/eco_stylist_ratings.csv')

### Qualitative (pt. 2):

#### Answer/Update to Question/Claim
- Eco-Stylist assesses brand sustainability based on the criteria transparency, fair labor, and sustainably made. Then, it assigns an overall rating.
- We successfully extracted data from Eco-Stylist's website to create a dataset of sustainability ratings for brands

#### Summary & Re-contextualization
- Breaking down the overall sustainability rating into distinct subratings helps provide context for why a brand recieved a certain score and allows the overall ratings to be easily compared
- We may create a similiar subrating to "Sustainably Made" for calculating our own sustainability ratings by evaluating what materials a product is made out of

#### Uncertainty, Limitations & Caveats
- A highly limited number of brands were rated by Eco-Stylist
- Eco-Stylist only evaluates sustainable brands (omitting most popular fast fashion brands)

#### New Problems & Next Steps
- We will scrape data from more organizations which review fashion brands to increase the number of ratings we can compare ours against
- How can we compare sustainability ratings across different organizations?

#### Story & Domain Knowledge
- Scraping data from the Eco-Stylist website was relatively simple because the links to each brand review had a uniform format 
- The criteria used by Eco-Stylist to determined the overall sustainability score resembles the citeria used by Good for You

## **Q7** How are sustainable brands rated on Sustainable Review? *(Sabrina)*

### Qualitative:

#### Problem - 
- Eco-Stylist only reviewed sustainable brands, and the information required to make an accurate assessment of brand sustainability is often unavailable on brand websites
- What criteria do established organizations which have reviewed a greater number of brands like Sustainable Review use to rate sustainability?

#### Hypothesis & Claim - 
- We will extract data from Sustainable Review's website to create a dataset of sustainabiltiy ratings for brands
- Summarizing the research and factors which determine the overall sustainability rating of a brand could provide insight into how we should rate brands 

#### Context, Motivation & Rationale - 
- Sustainable Review had a more comprehensive database of brands they have rated compared to Eco-Stylist
- Sustainable Review documents all of the research which ultimately determines the overall sustainability score
- We are motivated to investigate whether relevant factors for rating sustainability can be extracted from the description of each brand on Sustainable Review
- Webscraping with the Python library BeautifulSoup because it is popular and well documented

#### Definitions, Data, and Methods -
- Sustainability Review publishes sustainability content weekly
- Data will be collected from the Sustainability Review website
- Webscraping with the Python library BeautifulSoup to visit the webpage for each brand and get the overall rating and factors (headlines from within the description)

#### Biases & Assumptions -
- The research which informs the overall ratings is up to date
- All brands which Sustainable Review has collected sustainability data on are publicly available

In [None]:
# get page range
first_page_url = "https://sustainablereview.com/brand-ratings/"
first_page = session.get(first_page_url)
content = BeautifulSoup(first_page.text, 'html.parser')
# last page number out of all page numbers
last_page_num = str(content.find_all('a', class_="page-numbers")[-1]).split('>')[1].split('<')[0]

print(last_page_num)

38


In [None]:
# get brand links from all pages
brand_urls = []
# loop through all pages
for i in range(1, int(last_page_num) + 1):
    # scrape first page
    if i == 1:
        page = first_page
    # scrape remaining pages
    else:
        next_page_url = "https://sustainablereview.com/brand-ratings/?query-48-page="
        page = session.get(next_page_url + str(i))

    # content
    content = BeautifulSoup(page.text, 'html.parser')

    # links to brand reviews
    page_links = content.find_all("a")
    brands = set(link.get('href') for link in page_links if ("https://sustainablereview.com/brand-ratings/" in link.get('href')) and (link.get('href') != "https://sustainablereview.com/brand-ratings/"))

    # update brand urls list
    brand_urls.extend(brands)

print(brand_urls)

['https://sustainablereview.com/brand-ratings/division/', 'https://sustainablereview.com/brand-ratings/adidas/', 'https://sustainablereview.com/brand-ratings/adarche-clothing/', 'https://sustainablereview.com/brand-ratings/a-dam/', 'https://sustainablereview.com/brand-ratings/aestethic-london/', 'https://sustainablereview.com/brand-ratings/adelaide-c-ecoage/', 'https://sustainablereview.com/brand-ratings/337-brand/', 'https://sustainablereview.com/brand-ratings/a_c/', 'https://sustainablereview.com/brand-ratings/a-roege-hove/', 'https://sustainablereview.com/brand-ratings/adidas-by-stella-mccartney/', 'https://sustainablereview.com/brand-ratings/absolutely-bear/', 'https://sustainablereview.com/brand-ratings/acbc/', 'https://sustainablereview.com/brand-ratings/a-bch/', 'https://sustainablereview.com/brand-ratings/aeance/', 'https://sustainablereview.com/brand-ratings/afends/', 'https://sustainablereview.com/brand-ratings/a-happy-brand/', 'https://sustainablereview.com/brand-ratings/les

In [None]:
# create dataframe
sustainable_review_df = pd.DataFrame(columns=['Brand', 'Rating', 'Factors'])

In [None]:
# scrape for brand data
for url in brand_urls:
    try: 
        # search for brand review
        review = session.get(url)
        
        # content
        content = BeautifulSoup(review.text, 'html.parser')

        # brand
        brand = content.find('h1', class_='post-title').get_text()

        # rating
        information = content.find('div', class_='InfoBox')
        rating = information.find('p').get_text().split(" ")[3]
        
        # factors
        body = content.find('div', class_='col-md-12 col-lg-9')
        factors = str(body.find_all('h3')).split(', ')

        # clean list of factors
        cleaned_factors = []
        for factor in factors: 
            cleaned_factor = factor.replace("[","").replace("]","").replace('<h3>', '').replace('<strong>', '').replace('</h3>', '').replace('</strong>', '').replace("Similar brands:","")
            
            # drop ':' from factor
            if cleaned_factor.endswith(":"):
                cleaned_factor = cleaned_factor[:-1]

            # exclude headings with "Conclusion"
            if "Conclusion" in cleaned_factor:
                cleaned_factor = ""

            # append cleaned factor to list if the item is not emtpy
            if cleaned_factor != "":
                cleaned_factors.append(cleaned_factor)

        # update dataframe
        sustainable_review_df.loc[len(sustainable_review_df.index)] = [brand, rating, ", ".join(cleaned_factors)]
    
    except:
        print(url + " Failed to Load")

https://sustainablereview.com/brand-ratings/the-social-studio/ Failed to Load
https://sustainablereview.com/brand-ratings/the-white-ribbon/ Failed to Load


In [None]:
# visualize first 10 rows of the new dataset
sustainable_review_df.head(10)

Unnamed: 0,Brand,Rating,Factors
0,Modibodi,3,
1,Natalie Perry,4,
2,(di)vision,4,
3,Adidas,3,Adidas’ Global Recognition for Sustainability ...
4,Adarche Clothing,4,"Protecting the Planet with Adarche Clothing, E..."
5,A-dam,4,
6,Aestethic London,5,
7,Adelaide C. Ecoage,4,
8,337 BRAND,4,
9,A_C,4,


In [None]:
# export csv file
sustainable_review_df.to_csv('../data/sustainable_review_ratings.csv')

### Qualitative (pt. 2):
#### Answer/Update to Question/Claim
- Sustainability Review determines overall brand sustainability by synthesizing research on the brand's sustainability efforts. It rates brands on a scale of 1-5.
- We successfully extracted data from Sustainability Reviews's website to create a dataset of sustainability ratings for brands

#### Summary & Re-contextualization
- Synthesizing the main factors which went into the sustainability score by isolating the headlines in the brand description helped explain the score
- The factors we consider for our own sustainability score will be limited to the information available about a product online, but this often includes the materials used to create it

#### Uncertainty, Limitations & Caveats
- The quality and presence of headlines in the brand description varied greatly
- The large number of brands evaluated by Sustainability Review caused web scraping to take a long time and resulted in failure to access a couple links to brand reviews

#### New Problems & Next Steps
- Can we ensure that all links to brand reviews can be accessed by breaking down the scraping process?
- Are qualitative or quantitative ratings more effective for communicating the sustainability score of a fashion brand?

#### Story & Domain Knowledge
- Scraping data from Sustainability Review was more complex than scraping data from Eco-Stylist because the brands the organization had rated were not all available on one page
- The factors which went into the sustainability score line up with general domain knowledge about what makes a brand sustainable