# FashionFootprint QQQ Report

## **Q9**  How can we make data collection more efficient? *(Salley)*

### Qualitative:
#### Problem - 
- In last week's YouTube data pull, I had to rewrite code for each brand (long, tedious, more difficult to read)
- I also was limited to the first 50 results
#### Hypothesis & Claim - 
- How can I incorporate the pagination methods we learned in Week 1/2's lab in pulling data?
- I should be able to adjust my code to follow a similar pagination method to the lab and get more results.
#### Context, Motivation & Rationale - 
- We want larger datasets
- We want more efficient and more readable code
- Many of the videos in datasets we pulled last week did not include links we could scrape from, so having more data could allow us to build a bigger dataset
#### Assumptions & Biases - 
- Assuming all data gathered is accurate
- Biases towards certain keywords that could limit the types of fashion videos we pull
#### Definitions, Data, and Methods - 
- Data from YouTube API
- From `youtube-p1.ipynb` - *Your Final Task for Part 1 - Make this Reusable* code
- **Pagination** - retrieving large number of search results from YouTube API using multiple requests rather than getting them all in a single request
   - More efficient to handle larger datasets in smaller chunks
   - Ensure retrieval all the relevant search results for each brand by making multiple requests to YouTube API
   - Each request fetches a new page of results until all results have been retrieved
- **Methods:**
   - Using `nextPageToken` to use in next request to get next page of results
   - To retrieve all results for query, make multiple requests by passing `nextPageToken` received from previous request until there are no more results available

### Quantitative:

In [2]:
from googleapiclient.discovery import build
import pandas as pd
from datetime import datetime
import re
import os
import config

# set up YouTube Data API 
api_key = config.API_KEY
youtube = build('youtube', 'v3', developerKey=api_key)

In [3]:
def get_youtube_data(brand_name):
    published_after = datetime(2023, 9, 1).isoformat() + 'Z'
    published_before = datetime(2024, 4, 24).isoformat() + 'Z'
    keywords = ['haul', 'clothing', 'clothes', 'shop', 'shopping', 'try on', 'try-on', 'review', 'styling']
    social_media_links = ['pinterest', 'youtube', 'twitter', 'instagram', 'tiktok',
                          'reddit', 'twitch', 'facebook', 'thmatc', 'spotify']

    # fetch initial search results
    search_results = []
    request = youtube.search().list(
        q=brand_name,
        part='snippet',
        type='video',
        publishedAfter=published_after,
        publishedBefore=published_before,
        maxResults=10
    )
    response = request.execute()
    next_page_token = response.get('nextPageToken')
    
    while next_page_token is not None:
        # send request to YouTube API
        request = youtube.search().list(
            q=brand_name,
            part='snippet',
            type='video',
            publishedAfter=published_after,
            publishedBefore=published_before,
            maxResults=5,
            pageToken=next_page_token
        )
        response = request.execute()
        # add items from response 
        search_results.extend(response.get('items',[]))
        # get next page token for pagination
        next_page_token = response.get('nextPageToken')

    # process search results to extract relevant vid data
    brand_videos = []
    for search_result in search_results:
        # gets and stores video id
        video_id = search_result['id']['videoId']
        video_response = youtube.videos().list(
            # receive snippet part of data - title, description, tags, etc.
            part="snippet",
            id=video_id
        ).execute()

        # access description field of snipper
        description = video_response['items'][0]['snippet']['description']
        # extract links from description
        links = re.findall(r'(https?://\S+)', description)
        # makes all titles lowercase so code can match on any version of title:
        title = search_result['snippet']['title'].lower()

        # filters based on brand name and fashion related keywords
        if brand_name.lower() in title and any(keyword in title for keyword in keywords):
            # filters out social media links
            filtered_links = [link for link in links if not any(keyword in link for keyword in social_media_links)]
            # get link to video
            video_link = f"https://www.youtube.com/watch?v={video_id}"
            brand_videos.append({
                'title': search_result['snippet']['title'],
                'links': filtered_links,
                'videoLink': video_link
            })

    # process data and format for csv
    brand_youtube_data = []
    for video in brand_videos:
        if video['links']:
            brand_youtube_data.append({
                'Title': video['title'],
                'Links': '\n'.join(video['links']),
                'VideoLink': video['videoLink']
            })

    return brand_youtube_data

In [4]:
youtube_data = get_youtube_data('Uniqlo')

for video in youtube_data:
    print(f"Title: {video['Title']}")
    print(f"Links: {video['Links']}")
    print(f"Video Link: {video['VideoLink']}")
    print()
print()

HttpError: <HttpError 403 when requesting https://youtube.googleapis.com/youtube/v3/search?q=Uniqlo&part=snippet&type=video&publishedAfter=2023-09-01T00%3A00%3A00Z&publishedBefore=2024-04-24T00%3A00%3A00Z&maxResults=10&key=AIzaSyCs8DUh5dGPv0ZlKZx20k38TrQmhsuw4yM&alt=json returned "The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.". Details: "[{'message': 'The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.', 'domain': 'youtube.quota', 'reason': 'quotaExceeded'}]">

In [5]:
# reads uniqlo CSV into a pandas df
uniqlo_df = pd.read_csv("../data/youtube_data/uniqlo_youtube_data.csv")

# display the first 5 rows
print("First 5 rows of Uniqlo YouTube data:")
print(uniqlo_df.head())

First 5 rows of Uniqlo YouTube data:
                                               Title  \
0  vlog | tips on layering jewellery, Uniqlo haul...   
1  HUGE Uniqlo Try-On Haul ! Should I Keep or Ret...   
2  UNIQLO Pants| Are UNIQLO Pants Quality? Practi...   
3                      Uniqlo: C Try-On &amp; Review   
4  UNIQLO WINTER CAPSULE TRY ON HAUL! MY FAVE UNI...   

                                               Links  \
0  https://to.pandora.net/iamcharlotteolivia-cher...   
1  https://www.uniqlo.com/us/en/products/E450606-...   
2  https://www.uniqlo.com/us/en/products/E464744-...   
3  https://shopmy.us/cecifunnce\nhttps://go.shopm...   
4  http://loveindiamoon.co.uk\nhttps://www.vinted...   

                                     VideoLink  
0  https://www.youtube.com/watch?v=dXUSSkPrvEw  
1  https://www.youtube.com/watch?v=DS86919P3Vg  
2  https://www.youtube.com/watch?v=gNLCIBOomjM  
3  https://www.youtube.com/watch?v=aOXPQlC3aw8  
4  https://www.youtube.com/watch?v=jChlH6uDr

### Qualitative (pt. 2):
#### Answer/Update to Question/Claim
- Was able to use pagination for my code!
- Did not speed up data collection (for a list of brands of 3-4, it takes ~3 minutes)
- Made the code file much shorter since I could re-use code
- Made datasets much longer
#### Summary & Re-contextualization
- I was able to use similar methods as the YouTube lab to make my data pull processes from Week 4 reusable
- The code still runs for a long time and I have to use small lists of brands, but it makes datasets much longer
#### Story & Domain Knowledge
- Learned about limitations of YouTube API and API keys in general
- Learned about pagination and how it can increase efficiency and amount of data gathered
#### Uncertainty, Limitations & Caveats
- Reach quota limits very quickly
- Max number of brands I am able to pull from each API key is 4 brands
- Takes around 3 minutes to run each time
- Still tedious and time consuming, but able to collect a lot more data
#### New Problems & Next Steps
- How do we display this data in our Chrome Extension?
- How can we combine this with web scraping? Can we combine it?
- How do we combine datasets to access the correct item and its corresponding scores?

## **Q10**  How do we connect outputting scores & other information with YouTube data? *(Salley)*

### Qualitative:
#### Problem - 
- How do we connect different aspects of what our team is doing to create the Chrome extension?
- How can we combine our CSV files and scores?
#### Hypothesis & Claim - 
- We should be able to connect CSVs together by matching by brands, titles, and links
- Based on a link that we read, we can display information about clothing items in the video's description
#### Context, Motivation & Rationale - 
- Because we have been able to read links using web scraping, I am growing optimistic of how we can use a Chrome Extension to read the link of the YouTube video a person is watching and matching it with a link and/or title from our dataset
- I want to match by a certain factor to both put CSVs together and connect user input to items in our dataset
#### Assumptions & Biases - 
- Assuming YouTube links that we pull from YouTube API link to real videos
- I may have bias towards certain keywords and brands that makes me prioritize them during the testing and building process
#### Definitions, Data, and Methods - 
- Using Google Sheets to combine our CSVs together
   - Used `brand_info.csv` to pull people, planet, and brand score
   - Used *selflessclothes.com* to pull material score
- Converted combined CSV to JSON using ChatGPT
- Created basic web page using JS & HTML that took in a link, matched link with link in dataset, and outputted item name and scores

### Quantitative:

**`index.html`**

In [1]:
# <!DOCTYPE html>
# <html lang="en">
# <head>
#     <meta charset="UTF-8">
#     <meta name="viewport" content="width=device-width, initial-scale=1.0">
#     <title>Sustainability Score Finder</title>
#     <link rel="stylesheet" href="style.css">
# </head>
# <body>
#     <div class="video-link-box">
#         <h1>Sustainability Score Finder</h1>
#         <input type="text" id="videoLinkInput" placeholder="Enter video link here">
#         <button onclick="findScores()" id="findButton">Find Scores!</button>
#         <div id="result"></div>
#     </div>
#     <script src="basic-webpage.js"></script>
# </body>
# </html>

**`basic-webpage.js`**

In [2]:
# async function findScores() {
#     // retrieves inputted vid link from input field
#     var videoLink = document.getElementById("videoLinkInput").value;

#     try {
#         // get data from brand json file
#         var response = await fetch('uniqlo-data.json');
#         var brandDataset = await response.json();
        
#         // retrieves reference to result element (from html)
#         var resultDiv = document.getElementById("result");
#         // if link is found...
#         var found = false;

#         // basic table to display results
#         var itemScoreTable = `
#             <table>
#                 <tr>
#                     <th>Item</th>
#                     <th>Material Score</th>
#                     <th>Overall Score</th>
#                     <th>People Score</th>
#                     <th>Planet Score</th>
#                 </tr>
#         `;

#         for (var i=0; i < brandDataset.length; i++) {
#             // finds matching video link in brand dataset
#             // find more efficient way of doing this later...
#             if (brandDataset[i].VideoLink === videoLink) {
#                 found = true;
#                 // retrieves links to items in the vid's description
#                 var items = brandDataset[i].Links;
#                 // retrieves item name and scores
#                 for (var j=0; j < items.length; j++) {
#                     var item = items[j].ScrapedData.item;
#                     var materialScore = items[j].MaterialScore;
#                     var overallScore = items[j].overall_score;
#                     var peopleScore = items[j].people_score;
#                     var planetScore = items[j].planet_score;
#                     // adds each item to table 
#                     // MAKE LOOK BETTER
#                     itemScoreTable += `
#                         <tr>
#                             <td>${item}</td>
#                             <td>${materialScore}</td>
#                             <td>${overallScore}</td>
#                             <td>${peopleScore}</td>
#                             <td>${planetScore}</td>
#                         </tr>
#                     `;
#                 }
#             }
#         }
#         itemScoreTable += `</table>`;

#         if (found) {
#             resultDiv.innerHTML = itemScoreTable;
#         } else {
#             resultDiv.innerHTML = "Video link not found in the brandDataset.";
#         }
#     } catch (error) {
#         console.error('Error fetching or parsing data:', error);
#     }
# }

**`uniqlo-data.json`**

In [3]:
# {
#       "Title": "Uniqlo U Spring/Summer 2023 Styling Haul + Monthly Favourites",
#       "ShortTitle": "Uniqlo U Spring/Summer 2023 Styling Haul",
#       "VideoLink": "https://www.youtube.com/watch?v=9hktZEc3Vhs",
#       "Links": [
#         {
#           "URL": "https://rstyle.me/+CZM_fbUHWdR5PT9pvIqK8Q",
#           "ScrapedData": {
#             "item": "U Oversized Single Breasted Coat",
#             "cotton": 35,
#             "nylon": 30,
#             "polyester": 65
#           },
#           "CleanScrapedData": {
#             "item": "U Oversized Single Breasted Coat",
#             "cotton": 70,
#             "nylon": 30
#           },
#           "MaterialScore": 0.3,
#           "overall_score": 3,
#           "people_score": 3,
#           "planet_score": 3
#         },
#         {
#           "URL": "https://rstyle.me/+MAdavWCUzekigB04gtAKaw",
#           "ScrapedData": {
#             "item": "U Short Jacket",
#             "cotton": 100,
#             "polyester": 65
#           },
#           "CleanScrapedData": {
#             "item": "U Short Jacket",
#             "cotton": 100
#           },
#           "MaterialScore": 0.9,
#           "overall_score": 3,
#           "people_score": 3,
#           "planet_score": 3
#         }
#       ]
#     }

### Qualitative (pt. 2):
#### Answer/Update to Question/Claim
- We can use **brand name** and **YouTube link** to connect scores to each other and videos to items in our dataset
- Our Chrome Extension could potentially read the link of the video a user is currently on, match it with a link from our datasets to output information about it
- We did low-level, manual combining of datasets by looking for matching brands from CSV files and putting it together on Google Sheets 
#### Summary & Re-contextualization
- We were able to combine information across multiple datasets (Salley's YouTube video pull, Jasmine's brand-specific web scraping, Sabrina & Megan's brand/clothing item score web scraping)
#### Story & Domain Knowledge
- Learned about how to read in json files in JS
- Learned about how to look for an item in a dataset based on user input and display other information within that row
#### Uncertainty, Limitations & Caveats
- Is this possible on a larger scale - we are currently working on a very small json file and matching by using for loops...how possible is this to do on a much larger dataset?
- The basic webpage we made is not a Chrome Extension - will it apply to Chrome Extension-building?
#### New Problems & Next Steps
- How do we want to display scores? In what format (design-focused)
- Can we display more information? Can we display brand recommendations?
- How do we convert the code we wrote for the web page to a Chrome Extension?