## **Q2** How can comprehensive brand ratings be collected from Sustainable Review? *(Sabrina)*
### Qualitative:
#### Problem -
- Prior web scraping methods for collecting brand ratings from Sustainable Review took ~30 minutes
- Despite attempting up to 3 times, the code always failed to access a few links
- Only collecting the headers in brand descriptions did not provide enough information about how the brand rating was determined
- How can the web scraping process be broken down to speed up the collection process and ensure all links are accessed?
#### Hypothesis & Claim -
- We can break down the web scraping process by writing out a function which specifically scrapes brand review pages by Sustainable Review 
- Increasing the number of attempts will increase the chances that all links are accessed.
#### Context, Motivation & Rationale -
- We are motivated to shorten the length of time it takes to scrape the website is important because the code must be run every time the website updates.
- Ensuring that the resulting dataset is comprehensive will allow us to compare the ratings of more brands across organizations
- Webscraping with the Python library BeautifulSoup because it is popular and well documented
- Sustainable Review documents all of the research which ultimately determines the overall sustainability score in their brand description
#### Definitions, Data, and Methods -
- Sustainable Review publishes sustinability content weekly
- Data will be collected from the Sustainability Review website
- The Python library requests which allows us to define the number of attempts
- Webscraping with the Python library BeautifulSoup to visit the webpage for each brand review and get the overall rating and brand description
#### Biases & Assumptions -
- All brand review pages include the brand name, rating, and description
- All brand review pages are structured the same
- Breaking down the web scraping process will speed it up

### Quantitative:

In [4]:
# imports
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import pandas as pd
import re
import csv

In [15]:
# create session
session = requests.Session()
# retry five times in case of exception
retry = Retry(connect=5, backoff_factor=0.5)
# apply delays between attempts
adapter = HTTPAdapter(max_retries=retry)
session.mount('https://', adapter)

In [16]:
# get page range
first_page_url = "https://sustainablereview.com/brand-ratings/"
first_page = session.get(first_page_url)
content = BeautifulSoup(first_page.text, 'html.parser')
# last page number out of all page numbers
last_page_num = str(content.find_all('a', class_="page-numbers")[-1]).split('>')[1].split('<')[0]

print(last_page_num)

39


In [17]:
# get brand links from all pages
brand_urls = []
# loop through all pages
for i in range(1, int(last_page_num) + 1):
    # scrape first page
    if i == 1:
        page = first_page
    # scrape remaining pages
    else:
        next_page_url = "https://sustainablereview.com/brand-ratings/?query-48-page="
        page = session.get(next_page_url + str(i))

    # content
    content = BeautifulSoup(page.text, 'html.parser')

    # links to brand reviews
    page_links = content.find_all("a")
    brands = set(link.get('href') for link in page_links if ("https://sustainablereview.com/brand-ratings/" in link.get('href')) and (link.get('href') != "https://sustainablereview.com/brand-ratings/"))

    # update brand urls list
    brand_urls.extend(brands)

print(brand_urls)

['https://sustainablereview.com/brand-ratings/aarven/', 'https://sustainablereview.com/brand-ratings/a-a-k-s/', 'https://sustainablereview.com/brand-ratings/aestethic-london/', 'https://sustainablereview.com/brand-ratings/a-dam/', 'https://sustainablereview.com/brand-ratings/adele-dejak/', 'https://sustainablereview.com/brand-ratings/absolutely-bear/', 'https://sustainablereview.com/brand-ratings/adarche-clothing/', 'https://sustainablereview.com/brand-ratings/a-bch/', 'https://sustainablereview.com/brand-ratings/afroblonde/', 'https://sustainablereview.com/brand-ratings/division/', 'https://sustainablereview.com/brand-ratings/acbc/', 'https://sustainablereview.com/brand-ratings/a-roege-hove/', 'https://sustainablereview.com/brand-ratings/a_c/', 'https://sustainablereview.com/brand-ratings/aeance/', 'https://sustainablereview.com/brand-ratings/aera/', 'https://sustainablereview.com/brand-ratings/adidas/', 'https://sustainablereview.com/brand-ratings/about-companions/', 'https://sustain

In [18]:
# create dataframe
sustainable_review_df = pd.DataFrame(columns=['Brand', 'Rating', 'Description', 'URL'])

In [19]:
# define function to scrape for brand data
def scrape_sustainable_review(urls, df):
    for url in urls:
        try: 
            # search for brand review
            review = session.get(url)
            
            # content
            content = BeautifulSoup(review.text, 'html.parser')

            # brand
            brand = content.find('h1', class_='post-title').get_text()

            # rating
            information = content.find('div', class_='InfoBox')
            rating = information.find('p').get_text().split(" ")[3]
            
            # description
            body = content.find('div', class_='col-md-12 col-lg-9')
            # find similar brands section of description
            similar_brands = body.find_all('div', class_='col-sm-6 col-md-4')
            # clear similar brands section
            for rec in similar_brands:
                for child in rec.findChildren():
                    child.extract()
            # clean description
            description = str(body.find_all('p')).replace('<p>', '').replace('</p>,', '').replace('</p>', '').replace("[","").replace("]","")

            # update dataframe
            df.loc[len(df.index)] = [brand, rating, description, url]
        
        except Exception as e:
            print(str(brand) + ": " + str(e))

In [20]:
# scrape for sustainable review data
scrape_sustainable_review(brand_urls, sustainable_review_df)

Test Page: 'NoneType' object has no attribute 'find'


In [22]:
# visualize first 10 rows of the new dataset
sustainable_review_df.head(10)

Unnamed: 0,Brand,Rating,Description,URL
0,AARVEN,4,The AARVEN brand is a renowned apparel and jew...,https://sustainablereview.com/brand-ratings/aa...
1,A A K S,4,A A K S is a unique and sustainable brand that...,https://sustainablereview.com/brand-ratings/a-...
2,Aestethic London,5,Aestethic London is an environmentally conscio...,https://sustainablereview.com/brand-ratings/ae...
3,A-dam,4,A-dam is a sustainable fashion brand that is m...,https://sustainablereview.com/brand-ratings/a-...
4,Adele Dejak,3,Adele Dejak is a brand that is making efforts ...,https://sustainablereview.com/brand-ratings/ad...
5,Absolutely Bear,4,Absolutely Bear is an eco-conscious brand that...,https://sustainablereview.com/brand-ratings/ab...
6,Adarche Clothing,4,"When it comes to sustainability, Adarche Cloth...",https://sustainablereview.com/brand-ratings/ad...
7,A.BCH,5,"\nA.BCH, a leading sustainable fashion brand, ...",https://sustainablereview.com/brand-ratings/a-...
8,AFROBLONDE,4,Our “Planet” rating evaluates brands based on ...,https://sustainablereview.com/brand-ratings/af...
9,(di)vision,4,(di)vision is a brand that takes sustainabilit...,https://sustainablereview.com/brand-ratings/di...


In [23]:
len(sustainable_review_df)

1140

In [24]:
# export csv file
sustainable_review_df.to_csv('../data/sustainable_review_ratings.csv')

### Qualitative (pt. 2):
#### Answer/Update to Question/Claim
- We successfully broke down the web scraping process by writing out a function which specifically scrapes brand review pages by Sustainable Review and takes in a list of links.
- As no active links were missed, increasing the number of attempts may have improved the chances that all links are accessed, but writing out a function made a greater difference.

#### Summary & Re-contextualization
- The average runtime for scraping the Sustainable Review website dropped from ~30 minutes to ~6 minutes, which is a major improvement.
- Collecting the full brand description rather than just the headings in the brand description yielded a lot more information about how brands were rated
- The Sustainable Review dataset is now at the same quality as the Eco-Stylist and Good on You datasets

#### Uncertainty, Limitations & Caveats
- Some links may still fail to be accessed even with more tries
- The brand description is much longer than just the headers, meaning the text includes more irrelevant words that will not help predict the brand rating

#### New Problems & Next Steps
- Since Megan worked on extracting relevant words from brand descriptions, adding the extracted keywords into this dataset and using them to calculate our own sustainability score would be the next step
- The quality of the dataset has also been improved, so it can be combined with the eco-stylist dataset for easier comparisons across ratings

#### Story & Domain Knowledge
- Scraping the brand ratings from Sustainable Review again allowed us to improve upon prior code, making it more efficient, less redundant, and more reliable
- We relied on domain knowledge about Python to determine that scraping the entire brand description would be most helpful for calculating a sustainability score

# Part 2: Combine Rating Datasets

1. Encode qualitative ratings with quantitative values
2. Normalize brand names
3. Join dataframes on brand name
4. Create 2 dataframes (sustainable brands & all brands)

## **Q3** How can we compare ratings from organizatins like Good for You, Eco-Stylist, and Sustainable Review?
### Qualitative:
#### Problem -
- To compare brand ratings across organizations, we must read in 4 separate datasets and repeatedly combine them
- Each organization rated their brands on different scales and criteria, meaning some brands are ranked by category rather than a rating out of 5
- How can we combine all four datasets in an understandable way and compare different types of brand ratings?
#### Hypothesis & Claim -
- We will join all four datasets on the name of the brand and standardize the overall brand rating from each organization to calculate the average rating for a brand
- Converting qualitative rankings to quantiative ratings will make brand ratings easier to compare because quantiative ratings are easier to scale
#### Context, Motivation & Rationale -
- We are motivated to combine the datasets in a clear and understandable way so that we only have to read in 1 dataset in the future when conducting data analysis and calculating sustainability scores
- Combining the datasets beforehand also removes the possibility that they will be combined incorrectly later
- Synthesizing the overall brand ratings through an average makes sense since they all have well documented methodologies for determining brand ratings
- Data manipulation in the Python library Pandas because it is popular and well documented
#### Definitions, Data, and Methods -
- Sustainability Review publishes sustainability content weekly
- Eco-Stylist is a guide to ethical and eco-friendly fashion
- Good on You is home to thousands of brand ratings, articles, and sustainable fashion expertise
- Data will be read in from the existing data folder in our GitHub repository 
- Datasets will be joined and categorical ratings will be converted using the industry standard Python library Pandas for data analysis and manipulation  
#### Biases & Assumptions -
- All brand rankings hold the same weight (given that they are being averaged)
- All brand names are unique and can be used as the key to join on

### Quantitative:

In [102]:
# read in datasets
eco_stylist = pd.read_csv("../data/eco_stylist_ratings.csv")
sustainable_review = pd.read_csv("../data/sustainable_review_ratings.csv")
good_on_you_recommended = pd.read_csv("../data/gou_recommended.csv")
good_on_you = pd.read_csv("../data/brand_info.csv")

eco_stylist.head(5)

Unnamed: 0.1,Unnamed: 0,Brand,Overall,Transparency,Fair Labor,Sustainably Made,URL
0,0,KNOWN SUPPLY,Certified,Good,Excellent,Good,https://www.eco-stylist.com/ethical-brand/know...
1,1,AMENDI,Certified,Excellent,Fair,Good,https://www.eco-stylist.com/ethical-brand/amendi/
2,2,SANVT,Certified,Excellent,Good,Good,https://www.eco-stylist.com/ethical-brand/sanvt/
3,3,Patagonia,Gold,Excellent,Excellent,Excellent,https://www.eco-stylist.com/ethical-brand/pata...
4,4,Scotch & Soda,Silver,Excellent,Fair,Excellent,https://www.eco-stylist.com/ethical-brand/scot...


In [103]:
# correct index column for sustainable_review and eco_stylist
eco_stylist.drop(columns='Unnamed: 0', inplace=True)
sustainable_review.drop(columns='Unnamed: 0', inplace=True)

eco_stylist.head(5)

Unnamed: 0,Brand,Overall,Transparency,Fair Labor,Sustainably Made,URL
0,KNOWN SUPPLY,Certified,Good,Excellent,Good,https://www.eco-stylist.com/ethical-brand/know...
1,AMENDI,Certified,Excellent,Fair,Good,https://www.eco-stylist.com/ethical-brand/amendi/
2,SANVT,Certified,Excellent,Good,Good,https://www.eco-stylist.com/ethical-brand/sanvt/
3,Patagonia,Gold,Excellent,Excellent,Excellent,https://www.eco-stylist.com/ethical-brand/pata...
4,Scotch & Soda,Silver,Excellent,Fair,Excellent,https://www.eco-stylist.com/ethical-brand/scot...


In [104]:
# check column types for eco-stylist prior to encoding
eco_stylist.dtypes

Brand               object
Overall             object
Transparency        object
Fair Labor          object
Sustainably Made    object
URL                 object
dtype: object

In [105]:
# order categorical rankings in rating columns
eco_stylist['Overall'] = pd.Categorical(eco_stylist['Overall'], categories=['Certified', 'Silver', 'Gold'], ordered=True)
subratings = ['Transparency', 'Fair Labor', 'Sustainably Made']
for subrating in subratings:
    eco_stylist[subrating] = pd.Categorical(eco_stylist[subrating], categories=['Good', 'Fair', 'Excellent'], ordered=True)

eco_stylist.dtypes

Brand                 object
Overall             category
Transparency        category
Fair Labor          category
Sustainably Made    category
URL                   object
dtype: object

In [106]:
# encode categorical rankings from eco_stylist into quantiative ratings
eco_stylist_encoded = eco_stylist.copy()
categorical_columns = eco_stylist_encoded.select_dtypes(['category']).columns
eco_stylist_encoded[categorical_columns] = eco_stylist_encoded[categorical_columns].apply(lambda x: x.cat.codes + 1)

eco_stylist_encoded.head(5)

Unnamed: 0,Brand,Overall,Transparency,Fair Labor,Sustainably Made,URL
0,KNOWN SUPPLY,1,1,3,1,https://www.eco-stylist.com/ethical-brand/know...
1,AMENDI,1,3,2,1,https://www.eco-stylist.com/ethical-brand/amendi/
2,SANVT,1,3,1,1,https://www.eco-stylist.com/ethical-brand/sanvt/
3,Patagonia,3,3,3,3,https://www.eco-stylist.com/ethical-brand/pata...
4,Scotch & Soda,2,3,2,3,https://www.eco-stylist.com/ethical-brand/scot...


In [107]:
# scale quantiative ratings up from 1-3 to 1-5
eco_stylist_scaled = eco_stylist_encoded.copy()
rating_columns = eco_stylist_scaled.select_dtypes(['int8']).columns
eco_stylist_scaled[rating_columns] = eco_stylist_scaled[rating_columns].apply(lambda x: round(((5/3) * x), 1))

eco_stylist_scaled.head(5)

Unnamed: 0,Brand,Overall,Transparency,Fair Labor,Sustainably Made,URL
0,KNOWN SUPPLY,1.7,1.7,5.0,1.7,https://www.eco-stylist.com/ethical-brand/know...
1,AMENDI,1.7,5.0,3.3,1.7,https://www.eco-stylist.com/ethical-brand/amendi/
2,SANVT,1.7,5.0,1.7,1.7,https://www.eco-stylist.com/ethical-brand/sanvt/
3,Patagonia,5.0,5.0,5.0,5.0,https://www.eco-stylist.com/ethical-brand/pata...
4,Scotch & Soda,3.3,5.0,3.3,5.0,https://www.eco-stylist.com/ethical-brand/scot...


In [108]:
# normalize brand names function
def normalize_brand_names(df, brand_col):
    try:
        norm_brands = []
        for brand in df[brand_col]: 
            # drop all non-special characters
            norm_brands.append(''.join(e.lower() for e in brand if e.isalnum()))
        
        # create new column for normalized brand names
        df['norm_brands'] = norm_brands

    except Exception as e:
        print(e)

In [109]:
# normalize eco-stylist brand names
normalize_brand_names(eco_stylist_scaled, 'Brand')

# check that all brand names remain unique
print("# of brand names: " + str(len(eco_stylist_scaled['Brand'].unique())))
print("# of normalized brand names: " + str(len(eco_stylist_scaled['norm_brands'].unique())))

eco_stylist_scaled.head(5)

# of brand names: 102
# of normalized brand names: 102


Unnamed: 0,Brand,Overall,Transparency,Fair Labor,Sustainably Made,URL,norm_brands
0,KNOWN SUPPLY,1.7,1.7,5.0,1.7,https://www.eco-stylist.com/ethical-brand/know...,knownsupply
1,AMENDI,1.7,5.0,3.3,1.7,https://www.eco-stylist.com/ethical-brand/amendi/,amendi
2,SANVT,1.7,5.0,1.7,1.7,https://www.eco-stylist.com/ethical-brand/sanvt/,sanvt
3,Patagonia,5.0,5.0,5.0,5.0,https://www.eco-stylist.com/ethical-brand/pata...,patagonia
4,Scotch & Soda,3.3,5.0,3.3,5.0,https://www.eco-stylist.com/ethical-brand/scot...,scotchsoda


In [110]:
# normalize sustainable review brand names and check that all brand names remain unique
normalize_brand_names(sustainable_review, 'Brand')

# check that all brand names remain unique
print("# of brand names: " + str(len(sustainable_review['Brand'].unique())))
print("# of normalized brand names: " + str(len(sustainable_review['norm_brands'].unique())))

sustainable_review.head(5)

# of brand names: 1140
# of normalized brand names: 1140


Unnamed: 0,Brand,Rating,Description,URL,norm_brands
0,AARVEN,4,The AARVEN brand is a renowned apparel and jew...,https://sustainablereview.com/brand-ratings/aa...,aarven
1,A A K S,4,A A K S is a unique and sustainable brand that...,https://sustainablereview.com/brand-ratings/a-...,aaks
2,Aestethic London,5,Aestethic London is an environmentally conscio...,https://sustainablereview.com/brand-ratings/ae...,aestethiclondon
3,A-dam,4,A-dam is a sustainable fashion brand that is m...,https://sustainablereview.com/brand-ratings/a-...,adam
4,Adele Dejak,3,Adele Dejak is a brand that is making efforts ...,https://sustainablereview.com/brand-ratings/ad...,adeledejak


In [111]:
# normalize good on you brand names and check that all brand names remain unique
normalize_brand_names(good_on_you, 'brand')

# check that all brand names remain unique
print("# of brand names: " + str(len(good_on_you['brand'].unique())))
print("# of normalized brand names: " + str(len(good_on_you['norm_brands'].unique())))

good_on_you.head(5)

# of brand names: 105
# of normalized brand names: 105


Unnamed: 0,brand,overall_rating,planet_score,people_score,animals_score,description,norm_brands
0,Princess Polly,2,2.0,2.0,4.0,Our “Planet” rating evaluates brands based on ...,princesspolly
1,Brandy Melville,1,1.0,1.0,0.0,This brand provides insufficient relevant info...,brandymelville
2,Shein,1,1.0,1.0,2.0,Our “Planet” rating evaluates brands based on ...,shein
3,Nike,3,3.0,3.0,2.0,Our “Planet” rating evaluates brands based on ...,nike
4,Abercrombie & Fitch,2,2.0,2.0,2.0,Abercrombie & Fitch is owned by Abercrombie Ab...,abercrombiefitch


In [112]:
# normalize recommendedgood on you brand names and check that all brand names remain unique
normalize_brand_names(good_on_you_recommended, 'brand')

# check that all brand names remain unique
print("# of brand names: " + str(len(good_on_you_recommended['brand'].unique())))
print("# of normalized brand names: " + str(len(good_on_you_recommended['norm_brands'].unique())))

good_on_you_recommended.head(5)

# of brand names: 454
# of normalized brand names: 453


Unnamed: 0,brand,overall_rating,planet_score,people_score,animals_score,description,norm_brands
0,Enfant Terrible,4,5,4,4.0,Enfant Terrible's environment rating is 'great...,enfantterrible
1,milo+nicki,4,5,3,4.0,milo+nicki's environment rating is 'great'. It...,milonicki
2,Birdsong,4,5,5,3.0,Birdsong's environment rating is 'great'. It u...,birdsong
3,DAYWEARLAB,4,5,3,3.0,DAYWEARLAB's environment rating is 'great'. It...,daywearlab
4,LOVETRUST,4,4,3,5.0,LOVETRUST's environment rating is 'good'. It u...,lovetrust


In [113]:
# identify duplicate normalized brand name 
norm_brands = []
for brand in good_on_you_recommended['norm_brands']:
    if brand not in norm_brands:
        # add brand to list if it is not a duplicate
        norm_brands.append(brand)
    else:
        print("Duplicate brand name: " + brand)

Duplicate brand name: mori


In [114]:
# extract columns with duplicate normalized brand name
good_on_you_recommended[good_on_you_recommended['norm_brands'] == 'mori']

Unnamed: 0,brand,overall_rating,planet_score,people_score,animals_score,description,norm_brands
73,MORI,3,3,2,4.0,MORI's environment rating is 'it's a start'. I...,mori
124,mori,4,5,4,4.0,Our “Planet” rating evaluates brands based on ...,mori


In [115]:
# clean and rename columns of all datasets
eco_stylist_scaled.drop(columns='URL', inplace=True)
eco_stylist_scaled.rename(columns={'Brand': 'es_brand', 'Overall':'es_rating', 'Transparency':'es_transparency', 'Fair Labor':'es_fair_labor', 'Sustainably Made':'es_sustainably_made'}, inplace=True)

sustainable_review.drop(columns='URL', inplace=True)
sustainable_review.rename(columns={'Brand': 'sr_brand', 'Rating':'sr_rating', 'Description':'sr_description'}, inplace=True)

good_on_you.rename(columns={'brand': 'goy_brand', 'overall_rating':'goy_rating', 'planet_score':'goy_planet', 'people_score':'goy_people', 'animals_score':'goy_animals', 'description':'goy_description'}, inplace=True)

good_on_you_recommended.rename(columns={'brand': 'goyr_brand', 'overall_rating':'goyr_rating', 'planet_score':'goyr_planet', 'people_score':'goyr_people', 'animals_score':'goyr_animals', 'description':'goyr_description'}, inplace=True)

In [116]:
# check for duplicate normalized brand name in other datasets
for df in [eco_stylist_scaled, sustainable_review, good_on_you]:
    for brand in df['norm_brands']:
        if 'mori' == brand:
            duplicate = df[df['norm_brands'] == 'mori']

            print(duplicate)
            print(duplicate.index)

    sr_brand  sr_rating                                     sr_description  \
652     mori          4  Our “Planet” rating evaluates brands based on ...   

    norm_brands  
652        mori  
Index([652], dtype='int64')


In [137]:
# manually alter duplicate normalized brand name to join datasets
good_on_you_recommended['norm_brands'].loc[good_on_you_recommended['goyr_brand'] == 'MORI'] = 'mori2'
good_on_you_recommended[good_on_you_recommended['goyr_brand'] == 'MORI']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  good_on_you_recommended['norm_brands'].loc[good_on_you_recommended['goyr_brand'] == 'MORI'] = 'mori2'


Unnamed: 0,goyr_brand,goyr_rating,goyr_planet,goyr_people,goyr_animals,goyr_description,norm_brands
73,MORI,3,3,2,4.0,MORI's environment rating is 'it's a start'. I...,mori2


In [174]:
# join datasets on normalized brand names
two_brands = pd.merge(eco_stylist_scaled, sustainable_review, how='outer', on='norm_brands')
three_brands = pd.merge(two_brands, good_on_you, how='outer', on='norm_brands')
all_brand_ratings = pd.merge(three_brands, good_on_you_recommended, how='outer', on='norm_brands')

all_brand_ratings.head(5)

Unnamed: 0,es_brand,es_rating,es_transparency,es_fair_labor,es_sustainably_made,norm_brands,sr_brand,sr_rating,sr_description,goy_brand,...,goy_planet,goy_people,goy_animals,goy_description,goyr_brand,goyr_rating,goyr_planet,goyr_people,goyr_animals,goyr_description
0,KNOWN SUPPLY,1.7,1.7,5.0,1.7,knownsupply,Known Supply,4.0,"<a data-wpel-link=""external"" href=""https://kno...",,...,,,,,,,,,,
1,AMENDI,1.7,5.0,3.3,1.7,amendi,AMENDI,4.0,AMENDI is a brand that is making commendable e...,,...,,,,,AMENDI,4.0,4.0,3.0,4.0,AMENDI's environment rating is 'good'. It uses...
2,SANVT,1.7,5.0,1.7,1.7,sanvt,,,,,...,,,,,,,,,,
3,Patagonia,5.0,5.0,5.0,5.0,patagonia,Patagonia,4.0,Patagonia is a brand that focuses on sustainab...,Patagonia,...,4.0,2.0,4.0,Our “Planet” rating evaluates brands based on ...,Patagonia,4.0,4.0,2.0,4.0,Our “Planet” rating evaluates brands based on ...
4,Scotch & Soda,3.3,5.0,3.3,5.0,scotchsoda,,,,,...,,,,,,,,,,


In [175]:
# compare lengths of individual datasets and joined datasets
print("Total length of individual datasets: " + str(len(eco_stylist_scaled) + len(sustainable_review) + len(good_on_you) + len(good_on_you_recommended)))
print("Total length of joined datasets: " + str(len(all_brand_ratings)))

Total length of individual datasets: 1801
Total length of joined datasets: 1328


In [176]:
# combine brand name columns from each dataset
all_brand_ratings['brand'] = all_brand_ratings['es_brand']
all_brand_ratings['brand'].loc[all_brand_ratings['brand'].isnull()] = all_brand_ratings['sr_brand']
all_brand_ratings['brand'].loc[all_brand_ratings['brand'].isnull()] = all_brand_ratings['goy_brand']
all_brand_ratings['brand'].loc[all_brand_ratings['brand'].isnull()] = all_brand_ratings['goyr_brand']

print("Contains empty brand names: " + str(all_brand_ratings['brand'].isnull().values.any()))

Contains empty brand names: False


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_brand_ratings['brand'].loc[all_brand_ratings['brand'].isnull()] = all_brand_ratings['sr_brand']
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_brand_ratings['brand'].loc[all_brand_ratings['brand'].isnull()] = all_brand_ratings['goy_brand']
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_brand_ratings['brand'].loc[all_brand_ratings['brand'].isnull()] = all_brand_ratings['goyr_brand']


In [177]:
# drop redundant brand name columns
all_brand_ratings.drop(columns=['es_brand', 'sr_brand', 'goy_brand', 'goyr_brand', 'norm_brands'], inplace=True)
all_brand_ratings.head(5)

Unnamed: 0,es_rating,es_transparency,es_fair_labor,es_sustainably_made,sr_rating,sr_description,goy_rating,goy_planet,goy_people,goy_animals,goy_description,goyr_rating,goyr_planet,goyr_people,goyr_animals,goyr_description,brand
0,1.7,1.7,5.0,1.7,4.0,"<a data-wpel-link=""external"" href=""https://kno...",,,,,,,,,,,KNOWN SUPPLY
1,1.7,5.0,3.3,1.7,4.0,AMENDI is a brand that is making commendable e...,,,,,,4.0,4.0,3.0,4.0,AMENDI's environment rating is 'good'. It uses...,AMENDI
2,1.7,5.0,1.7,1.7,,,,,,,,,,,,,SANVT
3,5.0,5.0,5.0,5.0,4.0,Patagonia is a brand that focuses on sustainab...,4.0,4.0,2.0,4.0,Our “Planet” rating evaluates brands based on ...,4.0,4.0,2.0,4.0,Our “Planet” rating evaluates brands based on ...,Patagonia
4,3.3,5.0,3.3,5.0,,,,,,,,,,,,,Scotch & Soda


In [178]:
# reorder columns
columns = list(all_brand_ratings.columns)
columns.insert(0, columns.pop(columns.index('brand')))
all_brand_ratings = all_brand_ratings.loc[:, columns]

all_brand_ratings.head(5)

Unnamed: 0,brand,es_rating,es_transparency,es_fair_labor,es_sustainably_made,sr_rating,sr_description,goy_rating,goy_planet,goy_people,goy_animals,goy_description,goyr_rating,goyr_planet,goyr_people,goyr_animals,goyr_description
0,KNOWN SUPPLY,1.7,1.7,5.0,1.7,4.0,"<a data-wpel-link=""external"" href=""https://kno...",,,,,,,,,,
1,AMENDI,1.7,5.0,3.3,1.7,4.0,AMENDI is a brand that is making commendable e...,,,,,,4.0,4.0,3.0,4.0,AMENDI's environment rating is 'good'. It uses...
2,SANVT,1.7,5.0,1.7,1.7,,,,,,,,,,,,
3,Patagonia,5.0,5.0,5.0,5.0,4.0,Patagonia is a brand that focuses on sustainab...,4.0,4.0,2.0,4.0,Our “Planet” rating evaluates brands based on ...,4.0,4.0,2.0,4.0,Our “Planet” rating evaluates brands based on ...
4,Scotch & Soda,3.3,5.0,3.3,5.0,,,,,,,,,,,,


In [180]:
# export csv file
all_brand_ratings.to_csv('../data/all_brand_ratings.csv')

### Qualitative (pt. 2):

#### Answer/Update to Question/Claim
- We successfully joined all four datasets on the name of the brand by normalizing the brand names and standardized the overall brand rating from each organization to calculate the average rating for a brand
- Converting qualitative rankings to quantiative ratings did make brand ratings easier to compare because quantiative ratings are easier to scale and can be plugged into math formulas

#### Summary & Re-contextualization
- Over 450+ brands were rated by multiple organizations, meaning their brand ratings could be averaged 
- Joining the datasets and writing out a csv will be helpful for calculating our own sustainability score and conducting data analysis because correctly joining the datasets was a complex process which involved manually normalizing brand names, resulting in a high potential for error
- Brand names had to be normalized to use as keys for joining because brand names often contain special characters
- Scaling the rankings for Eco-Stylist from 1-3 to 1-5 caused ratings to go from integers to floats, but this was not an issue because most of the averages are also floats

#### Uncertainty, Limitations & Caveats
- The Pandas function .to_csv() sometimes creates an unnamed column based on the index, which created some issues when initally reading in csv data from our github repository 
- Not all brand names were unique, as some were only differentiated by special characters or capitalization, so some normalized brand names had to be manually edited
- The code for combining columns and checking for duplicate normalized brand names is somewhat redundant

#### New Problems & Next Steps
- Moving forward, we hope to reduce the redundancy in the code for joining datasets to improve efficiency
- We also hope to remove the manual element of identifying and editing duplicate normalized brand names so that the code is more reliable and long lasting
- Additional columns may also be added containing the keywords extracted from the brand descriptions for calculating our own sustainability score

#### Story & Domain Knowledge
- The assumption that brand names would be unique was proven false, so our approach to combining the datasets had to adjust accordingly
- This joined dataset offers a far more comprehensive idea of how fashion brands are rated for sustainability and has to potential to contribute to general domain knowledge about the topic