# FashionFootprint QQQ Report

## **Q1**  How can we find fashion-related YouTube videos? *(Salley)*

### Qualitative:
#### Problem - 
- How can we use YouTube API to search through YouTube videos to find fashion-related ones we can use for web scraping?
- Will we be able to get enough data from YouTube?  
#### Hypothesis & Claim - 
- We should be able to search through fashion data by matching titles to a set of fashion-related keywords
- Then extract links from the video descriptions  
#### Context, Motivation & Rationale - 
- We want the Chrome Extension to work on YouTube, so seeing if we can access links and scrape them to provide feedback is an important first step.  
- Having a dataset of existing YouTube videos may also help us test our tool on a smaller scale while in the development stages
#### Rationale, Assumptions, Biases - 
- Assuming all YouTube data gathered is accurate and reliable
- My rationale in selecting my keywords is my own knowledge of how fashion-related YouTube videos are titled
- I may be biased towards certain keywords/titles due to my watching certain types of fashion content
#### Definitions, Data, and Methods - 
- Using YouTube API and YouTube Data  
- Using similar methods to our YouTube lab from Week 1 & 2 (getting video data - snippet)

### Quantitative:

In [4]:
from googleapiclient.discovery import build
import pandas as pd
from datetime import datetime
import re

# set up YouTube Data API 
api_key = "AIzaSyB4rEWrBMhi4lJEgfwsV386f44qwL3HxG4"
youtube = build('youtube', 'v3', developerKey=api_key)

In [5]:
# search by keywords
keywords = ['haul', 'clothing', 'clothes', 'shop', 'shopping', 'try on', 'try-on']

# set time period to be in 2023
published_after = datetime(2023, 1, 1).isoformat() + 'Z'
published_before = datetime(2023, 12, 31).isoformat() + 'Z'

In [6]:
# search for videos
search_response = youtube.search().list(
    q=keywords, 
    part='snippet',
    type='video',
    publishedAfter=published_after,
    publishedBefore=published_before,
    maxResults=1 # returns 1 result(video) that match w/criteria
).execute()

In [7]:
# process results
videos = []
for search_result in search_response.get('items', []):
    # get and store video id
    video_id = search_result['id']['videoId']
    video_response = youtube.videos().list(
        # receive snippet part of data - title, description, tags, etc.
        part="snippet",
        id=video_id
    ).execute()

    # accesses description field of snipper
    description = video_response['items'][0]['snippet']['description']
    
    # extract links from description
    links = re.findall(r'(https?://\S+)', description)
    
    # add video title, description, and links (if there are links) to videos list
    videos.append({
        'title': search_result['snippet']['title'],
        'links': links
    })

In [8]:
# filter vids into separate lists based on keywords
filtered_vids_keywords = {}

for keyword in keywords:
    filtered_vids_keywords[keyword] = [
        video for video in videos 
        if keyword.lower() in video['title'].lower() and video['links']
    ]
for keyword, vid_list in filtered_vids_keywords.items():
    print(f"VIDEOS WITH '{keyword}':")
    for video in vid_list:
        print("title:", video['title'])
        print("links:")
        for link in video['links']:
            print(link)
        print()

VIDEOS WITH 'haul':
title: SUMMER SHOP WITH ME!!🛒🎀 get out of a fashion rut, collective clothing haul, outfit inspiration
links:
https://www.aritzia.com/us/en/product/the-effortless-pant%E2%84%A2/96000.html?dwvar_96000_color=23914
https://www.aritzia.com/us/en/product/contour-mockneck-tank/83839.html?dwvar_83839_color=30252&dwvar_83839_size=3
https://www.aritzia.com/us/en/product/the-effortless-short%E2%84%A2-lo-rise-3%22/109952.html?dwvar_109952_color=11420
https://www.zara.com/us/en/full-length-trf-high-rise-wide-leg-jeans-p06045025.html?v1=277681186
https://us.princesspolly.com/products/city-of-angels-pant-spanish-grey?currency=USD&variant=39691555340372&utm_medium=cpc&utm_source=google&utm_campaign=Google%20Shopping&utm_source=cpc&utm_medium=google&utm_term=&adid=&matchtype=&addisttype=xpla&tw_source=google&tw_adid=&tw_campaign=19750607918&gclid=CjwKCAjwkeqkBhAnEiwA5U-uM3GDG8GiIhgeP9xrScVE8307JC-uystVIejBh10tUP3vNTA-8ekLDhoCE5cQAvD_BwE
https://www.birkenstock.com/us/boston-suede-le

### Qualitative (pt. 2):
#### Answer/Update to Question/Claim
- How can we use YouTube API to search through YouTube videos to find fashion-related ones we can use for web scraping?
   - Search for videos through using a list of keywords, then access their descriptions to extract links
- Will we be able to get enough data from YouTube?
   - There is a quota limit of 10,000 units per day 
- We should be able to search through fashion data by matching titles to a set of fashion-related keywords
   - Yes!
- Then extract links from the video descriptions
   - Yes!
#### Summary & Re-contextualization
- We are able to extract relevant videos by searching for keywords in their title that relate to fashion
- We are able to extract links from their descriptions
#### Story & Domain Knowledge
- Will apply knowledge gained in initial data pull from YouTube to future data pulls!
- Learned about YouTube API - how to gather specific types of data and organize/format it
#### Uncertainty, Limitations & Caveats
- We can only get a maximum of 50 results that match with a keyword
- There are links outputted that are not relevant (e.g., social media links)
- Not organized by super relevant keywords (e.g., clothing and try on)
#### New Problems & Next Steps
- Is there a better way to organize and extract data?
- How can we extract only relevant links?

## **Q2** Is there a better way to organize and extract data? *(Salley)*

### Qualitative:
#### Problem - 
- How can we better organize data?
- How can we get the max amount of data given the quota and results limit?
- How can we get rid of irrelevant links?
#### Hypothesis & Claim - 
- We can perform searches on specific brands initially, rather than keywords
- We can filter on fashion-related keywords after
- We can remove social media links by searching for keywords within the link itself
#### Context, Motivation & Rationale - 
- We can better organize data by brands
- We can extract more data since brands are more specific than keywords
- Our data will look cleaner and more organized if we remove social media links
#### Rationale, Assumptions, Biases - 
- Assuming all YouTube data gathered is accurate and reliable and are filtered by keywords I selected
- My rationale in selecting my keywords is my own knowledge of fashion, social media, etc.
- I may be biased towards selecting certain brands in my initial search due to perosnal opinions on them
- I may be biased towards filtering out certain social media links due to my personal knowledge of them
#### Definitions, Data, and Methods - 
- Similar methods as before with more filtering

### Quantitative

In [9]:
social_media_links = ['pinterest', 'youtube', 'twitter', 'instagram', 'tiktok', 'reddit', 'twitch', 'facebook', 'thmatc']

In [10]:
# search for videos
am_eagle_search_results = youtube.search().list(
    q='American Eagle', # search by a specific brand rather than set of fashion-related keywords
    part='snippet',
    type='video',
    publishedAfter=published_after,
    publishedBefore=published_before,
    maxResults=50
).execute()

In [11]:
am_eagle_videos = []
for search_result in am_eagle_search_results.get('items', []):
    video_id = search_result['id']['videoId']
    video_response = youtube.videos().list(
        part="snippet",
        id=video_id
    ).execute()

    description = video_response['items'][0]['snippet']['description']
    
    links = re.findall(r'(https?://\S+)', description)

    # makes all titles lowercase so code can match on any version of title:
    # (e.g., American Eagle, american eagle, AMERICAN EAGLE)
    title = search_result['snippet']['title'].lower()

    # filters based on 'american eagle' in title and fashion-related keywords
    if 'american eagle' in title and any(keyword in title for keyword in keywords):
        # filters out social media links
        filtered_links = [link for link in links if not any(keyword in link for keyword in social_media_links)]

        am_eagle_videos.append({
            'title': search_result['snippet']['title'],
            'links': filtered_links
        })

In [12]:
# only output videos with links in bios we can scrape
am_eagle_youtube_data = []

for video in am_eagle_videos:
    # check if vid has links
    if video['links']:
        # append the video to the filtered list
        am_eagle_youtube_data.append({
            'Title': video['title'],
            'Links': '\n'.join(video['links'])
        })

# formatted this way so it can be easily converted to csv using to_csv function from pandas
print(am_eagle_youtube_data)

[{'Title': 'American Eagle Denim Haul | Try On | BRUTALLY Honest Review', 'Links': 'https://shopltk.com/explore/Stephanie_Lauer/collections/11ee0ae3055ebff8abd40242ac110003'}, {'Title': 'Shopping While Curvy: Aerie + American Eagle || matching sets &amp; are the jeans curvy friendly?', 'Links': 'https://www.shoplivinfearless.com/\nhttps://kinkistry.com/collections/wefted-hair-closures/products/kinknesis-wefted-bundle?variant=6883127623744\nhttps://www.amazon.com/shop/livin_fearless\nhttps://youtu.be/UzmokCNkUEs\nhttps://youtu.be/gV9shyqatew\nhttps://rstyle.me/+JqMoH6aLHynGpkNoFEsOYQ\nhttps://rstyle.me/+jEAbFk4qNh_glPy2rZ4rdw\nhttps://rstyle.me/+msFu67UM9cc9OPuOQWZHrA\nhttps://rstyle.me/+IO41CFOlJAAB52Mm1N3pDw\nhttps://rstyle.me/+-JDPdOWnc2z3W17e9JTUVA\nhttps://rstyle.me/+WjFzJQzuaDHfgN3RExjR9A\nhttps://rstyle.me/+Rmjw6q4RE3LOAKek0laGYQ\nhttps://rstyle.me/+24Ir1CuXN1iwf3jvejJeTQ\nhttps://rstyle.me/+uxixSRGpy-HyJdfCui_hAg\nhttps://rstyle.me/+oDUkaCUcuIuGWay_ZCuVEw\nhttps://rstyle.me/+TO2

### Qualitative (pt. 2):
#### Answer/Update to Question/Claim
- How can we better organize data?
   - We are able to find videos and search based on specific brands
   - We can then filter further by searching for fashion-related keywords in the title
- How can we get the max amount of data given the quota and results limit?
   - Filtering by brands gets us more data since we can extract up to 50 fashion-related videos from a specific brand rather than 50 fashion-related videos overall
- How can we get rid of irrelevant links?
   - We can filter out certain links by matching by specific social medias
#### Summary & Re-contextualization
- Organize by brand and separate them
- Convert data into csv's, organized by brand
- Filter out some irrelevant links from description
#### Story & Domain Knowledge
- Learning from knowledge gained in initial data pull from YouTube to increase amount of data that I can gather
- Learned about how to organize and format data to be more clean
- More knowledge about filte
#### Uncertainty, Limitations & Caveats
- Takes a long time - limits number of brands we can gather data for
- Do we need even more data to train our tool/model?
#### New Problems & Next Steps
- How can we make the code itself more efficient? Right now, I have to rewrite the same code over and over again for each brand.
- Any further ways we can organize data to make it easier to scrape links? Add more specific columns?