# Capstone Project  - Book 1: Classifying real vs fake sneakers via images

## Problem Statement

The sneaker resale market is an estimated 2 billion USD in secondary market in 2019. Estimated to be USD6 billion by 2025 according to research firm, Cowen & Co. Due to the lucrative nature of these commodities, there is the inevitible rise of counterfeits. The counterfeit sneakers industry are a USD450 million market and we want to be able to differentiate real and fake sneakers. 

Our task is to build a classifier that is able to differentiate between real and fake sneakers. 
Our primary audience will be the sneaker brands. Some of the negative impacts of counterfeit sneakers includes undercutting sales of brands, damaging reputation and dealing with the lashback from consumers.

To do so, we will first be scrapping data from reddit and other sneaker resources and using classification models such as CNN and xxxx to diffentiate between the authentic and the replicas. We will measure our success using several classification metrics including xxxx and yyyy. 

With this, we also hope to help buyers inform themselves and to stay away from counterfeits. Empowering the public with information, they will be able to make the right decision which could help to reduce the lucrative nature of fake sneakers. 


## Executive Summary

As the data science team in Nutrino, we have been tasked to build a classifier to improve core product of the company, which is to provide nutrition related data services and analytics. We are also tasked to identify patterns on 2 currently trending diets, keto and vegan. 

Our classifier was successful in predicting at an above 90% accuracy score. We also identified patterns in the motivations and preferences of the 2 groups of subredditors, which will help determine the kind of customer engagement with teach group. 


## Notebooks:
- [Data Scrapping and Cleaning](./book1_data_scrapping_cleaning.ipynb)
- [EDA](./book2_eda.ipynb)
- [Modeling and Recommendations](./book3_preprocesing_modeling_recommendations.ipynb)


## Contents:
- [Import Libraries](#Import-Libraries)
- [Data Scrapping](#Data-Scrapping)
- [Data Cleaning](#Data-Cleaning)
- [Save Data to CSV](#Save-Data-to-CSV)

## Import Libraries

In [50]:
import requests
import pandas as pd
import urllib.request

## Data Scrapping

#### Get data from subreddits

Lucky for us, imgur is able to display images grouped by subreddit. We will be using their json API to retrieve links for the images we need. 

In [76]:
#give list of sub reddits
sub_reds = ["repsneakers","sneakermarket","sneakers"]

#create lists to store scrapped data
image_url = []
rep_label = []

#save links to variables
imgur     = 'http://i.imgur.com/{}{}'
page_api  = 'http://imgur.com/r/{}/new/page/{}/hit.json'
album_api = 'http://imgur.com/ajaxalbums/getimages/{}/hit.json'


for sub_red in sub_reds:                   #iterate through the list of sub_reds
    s = requests.session()                 #instantiate session
    s.headers['user-agent'] = 'Mozilla/5.0'

    for i in range(2):
        url = page_api.format(sub_red,i)
        print(f"retrieving url from page {i+1} of {sub_red}...")

        j = s.get(url).json()
        for entry in j['data']:
            if entry['ext'] == '.gif' or entry['ext'] == '.mp4':
                pass
            else:
                if entry['is_album']:
                    url = album_api.format(entry['hash'])
                    j = s.get(url).json()
                    for image in j['data']['images']:
                        if entry['ext'] == '.gif' or entry['ext'] == '.mp4':
                            pass
                        else:
                            url = imgur.format(image['hash'], image['ext'])
                            image_url.append(url)
                            rep_label.append(sub_red)
                else:
                    url = imgur.format(entry['hash'], entry['ext'])
                    image_url.append(url)
                    rep_label.append(sub_red)
            

#credit: https://kaijento.github.io/2017/05/08/web-scraping-imgur.com/

retrieving url from page 1 of repsneakers...
retrieving url from page 2 of repsneakers...
retrieving url from page 1 of sneakermarket...
retrieving url from page 2 of sneakermarket...
retrieving url from page 1 of sneakers...
retrieving url from page 2 of sneakers...


In [71]:
#here we check the number of urls we have scrapped
print(len(image_url))
print(len(rep_label))

1681
1681


#### Download Images

In [61]:
import urllib.request

# Adding information about user agent
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)

for url in image_url[:100]:
    dl_url = url
    dl_name = dl_url.split("/")[-1]

    urllib.request.urlretrieve(dl_url,f'./datasets/images/{dl_name}')
    
    
#credit: https://towardsdatascience.com/how-to-download-an-image-using-python-38a75cfa21c

KeyboardInterrupt: 

#for url in image_url

dl_url = image_url[0]
dl_name = dl_url.split("/")[-1]

r = requests.get(dl_url, stream=True)

if r.status_code == 200:
    r.raw.decode_contnet = True
    
    with open(dl_name,'wb') as f:
        shutil.copyfileobj(r.raw,f)
    
        
    print(f"Download Complete: {dl_name}")
    
else: 
    print(f"Download failed: {dl_name}")
    

#credit: https://towardsdatascience.com/how-to-download-an-image-using-python-38a75cfa21c

#### Convert to dataframe

In [77]:
df = pd.DataFrame(list(zip(image_url,rep_label)),columns=['url','label'])

In [78]:
df['label'].value_counts()

repsneakers      1220
sneakermarket     262
sneakers          199
Name: label, dtype: int64

In [74]:
df.tail()

Unnamed: 0,url,label
1676,http://i.imgur.com/boiDIEZ.jpg,sneakers
1677,http://i.imgur.com/O3IkyMU.jpg,sneakers
1678,http://i.imgur.com/97vmniK.jpg,sneakers
1679,http://i.imgur.com/JhWHsAn.jpg,sneakers
1680,http://i.imgur.com/RszyWCt.jpg,sneakers


## Save to CSV

In [23]:
df.to_csv('./datasets/url.csv')