# Capstone Project  - Book 1: Classifying real vs fake sneakers via images

## Problem Statement

The sneaker resale market is an estimated 2 billion USD in secondary market in 2019. Estimated to be USD6 billion by 2025 according to research firm, Cowen & Co. Due to the lucrative nature of these commodities, there is the inevitible rise of counterfeits. The counterfeit sneakers industry are a USD450 million market and we want to be able to differentiate real and fake sneakers. 

Our task is to build a classifier that is able to differentiate between real and fake sneakers. 
Our primary audience will be the sneaker brands. Some of the negative impacts of counterfeit sneakers includes undercutting sales of brands, damaging reputation and dealing with the lashback from consumers.

To do so, we will first be scrapping data from reddit and other sneaker resources and using classification models such as CNN and xxxx to diffentiate between the authentic and the replicas. We will measure our success using several classification metrics including xxxx and yyyy. 

With this, we also hope to help buyers inform themselves and to stay away from counterfeits. Empowering the public with information, they will be able to make the right decision which could help to reduce the lucrative nature of fake sneakers. 


## Executive Summary

As the data science team in Nutrino, we have been tasked to build a classifier to improve core product of the company, which is to provide nutrition related data services and analytics. We are also tasked to identify patterns on 2 currently trending diets, keto and vegan. 

Our classifier was successful in predicting at an above 90% accuracy score. We also identified patterns in the motivations and preferences of the 2 groups of subredditors, which will help determine the kind of customer engagement with teach group. 


## Notebooks:
- [Data Scrapping and Cleaning](./book1_data_scrapping_cleaning.ipynb)
- [EDA](./book2_eda.ipynb)
- [Modeling and Recommendations](./book3_preprocesing_modeling_recommendations.ipynb)


## Contents:
- [Import Libraries](#Import-Libraries)
- [Data Scrapping](#Data-Scrapping)
- [Data Cleaning](#Data-Cleaning)
- [Save Data to CSV](#Save-Data-to-CSV)

## Import Libraries

In [1]:
import requests
import numpy as np
import pandas as pd
import urllib.request

## Data Scrapping

#### Get data from subreddits

Lucky for us, imgur is able to display images grouped by subreddit. We will be using their json API to retrieve links for the images we need. 

In [114]:
#give list of sub reddits
sub_reds = ["repsneakers","sneakerreps", "wengkksneakers",
            "chanzhfsneakers","michaelsneakers","sneakermarket",
            "sneakers","sneakerhead","sneakerscanada"]

#create lists to store scrapped data
image_url = []
rep_label = []

#save links to variables
imgur     = 'http://i.imgur.com/{}{}'
page_api  = 'http://imgur.com/r/{}/new/page/{}/hit.json'
album_api = 'http://imgur.com/ajaxalbums/getimages/{}/hit.json'


for sub_red in sub_reds:                                 #iterate through the list of sub_reds
    s = requests.session()                               #instantiate session
    s.headers['user-agent'] = 'Mozilla/5.0'

    for i in range(5):                                   #iterate through pages
        url = page_api.format(sub_red,i)
        print(f"retrieving url from page {i+1} of {sub_red}...")

        j = s.get(url).json()
        for entry in j['data']:                          #iterate through post in each page
            if entry['ext'] == '.gif' or entry['ext'] == '.mp4':
                pass                                     #if its a gif or video, pass
            else:
                if entry['is_album']:                    #check if its an album
                    url = album_api.format(entry['hash'])
                    j = s.get(url).json()
                    for image in j['data']['images']:    #iterate through album
                        if entry['ext'] == '.gif' or entry['ext'] == '.mp4':
                            pass                         #if its a gif or video, pass
                        else:
                            url = imgur.format(image['hash'], image['ext'])
                            image_url.append(url)        #if not append link to list
                            rep_label.append(sub_red)    #add label
                else:
                    url = imgur.format(entry['hash'], entry['ext'])
                    image_url.append(url)
                    rep_label.append(sub_red)
            

#credit: https://kaijento.github.io/2017/05/08/web-scraping-imgur.com/

retrieving url from page 1 of sneakermarket...
retrieving url from page 2 of sneakermarket...
retrieving url from page 3 of sneakermarket...
retrieving url from page 4 of sneakermarket...
retrieving url from page 5 of sneakermarket...
retrieving url from page 6 of sneakermarket...
retrieving url from page 7 of sneakermarket...
retrieving url from page 8 of sneakermarket...
retrieving url from page 9 of sneakermarket...
retrieving url from page 10 of sneakermarket...
retrieving url from page 11 of sneakermarket...
retrieving url from page 12 of sneakermarket...
retrieving url from page 13 of sneakermarket...
retrieving url from page 14 of sneakermarket...
retrieving url from page 15 of sneakermarket...
retrieving url from page 16 of sneakermarket...
retrieving url from page 17 of sneakermarket...
retrieving url from page 18 of sneakermarket...
retrieving url from page 19 of sneakermarket...
retrieving url from page 20 of sneakermarket...
retrieving url from page 1 of sneakers...
retriev

In [115]:
#here we check the number of urls we have scrapped
print(len(image_url))
print(len(rep_label))

5564
5564


#### Create labels for the dataset

In [116]:
is_rep = ['rep' if label in ["repsneakers", "sneakerreps", "wengkksneakers", "chanzhfsneakers", "michaelsneakers"] else 'auth' for label in rep_label]

In [117]:
len(is_rep)

5564

#### Convert to dataframe

In [118]:
df = pd.DataFrame(list(zip(image_url,rep_label,is_rep)),columns=['url','label','is_rep'])
df.shape

(5564, 3)

In [140]:
df.head()

Unnamed: 0,url,label,is_rep
0,http://i.imgur.com/v92pz5I.jpg,sneakermarket,auth
1,http://i.imgur.com/VwpI9qg.jpg,sneakermarket,auth
2,http://i.imgur.com/OgT4CwF.jpg,sneakermarket,auth
3,http://i.imgur.com/RfeaaRK.jpg,sneakermarket,auth
4,http://i.imgur.com/veiqN5R.jpg,sneakermarket,auth


In [119]:
df['label'].value_counts()

sneakermarket     2500
sneakers          1992
sneakerscanada     704
sneakerhead        368
Name: label, dtype: int64

#### Observation 1 of the dataset
We have a severely imbalanced class in repsneakers here. I will attempt to dig deeper into the sneakers and sneaker market subreddits for more images. 

This is likely because in r/repsneakers there are alot of people looking to "quality control" (QC) for the best replicas. Therefore, there are going to be more images and with better details. 

In [137]:
df.drop_duplicates('url', inplace=True)

df.reset_index(inplace=True)
df.drop('index',axis=1,inplace=True)

In [131]:
df['label'].value_counts()

sneakermarket     625
sneakers          498
sneakerscanada    176
sneakerhead        92
Name: label, dtype: int64

In [132]:
df['is_rep'].value_counts()

auth    1391
Name: is_rep, dtype: int64

#### Observation 2 of the dataset

1. There seems to be some duplicates in the data. Seems like there is a limit on the number of posts to be scrapped. I will be scraping for new posts everyday. 
2. Here we can also see that there is a major class imbalance. I will be looking for more authentic sneaker subreddits

#### Download Images

In [5]:
#only for downloading images
df = pd.read_csv("./datasets/auth_url_1k.csv")

In [6]:
df.head()

Unnamed: 0,url,label,is_rep
0,http://i.imgur.com/v92pz5I.jpg,sneakermarket,auth
1,http://i.imgur.com/VwpI9qg.jpg,sneakermarket,auth
2,http://i.imgur.com/OgT4CwF.jpg,sneakermarket,auth
3,http://i.imgur.com/RfeaaRK.jpg,sneakermarket,auth
4,http://i.imgur.com/veiqN5R.jpg,sneakermarket,auth


In [8]:
import urllib.request

# Adding information about user agent
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)

for i in range(912,df.shape[0]):
    dl_url = df.loc[i,"url"]
    temp_label = df.loc[i,"is_rep"]
    dl_name = dl_url.split("/")[-1]

    #retrieve image and save in images folder
    print(f"downloading {i} of {df.shape[0]} images...")
    urllib.request.urlretrieve(dl_url,f"./datasets/images/{temp_label}/{dl_name}")
    
    
#credit: https://towardsdatascience.com/how-to-download-an-image-using-python-38a75cfa21c

downloading 912 of 1391 images...
downloading 913 of 1391 images...
downloading 914 of 1391 images...
downloading 915 of 1391 images...
downloading 916 of 1391 images...
downloading 917 of 1391 images...
downloading 918 of 1391 images...
downloading 919 of 1391 images...
downloading 920 of 1391 images...
downloading 921 of 1391 images...
downloading 922 of 1391 images...
downloading 923 of 1391 images...
downloading 924 of 1391 images...
downloading 925 of 1391 images...
downloading 926 of 1391 images...
downloading 927 of 1391 images...
downloading 928 of 1391 images...
downloading 929 of 1391 images...
downloading 930 of 1391 images...
downloading 931 of 1391 images...
downloading 932 of 1391 images...
downloading 933 of 1391 images...
downloading 934 of 1391 images...
downloading 935 of 1391 images...
downloading 936 of 1391 images...
downloading 937 of 1391 images...
downloading 938 of 1391 images...
downloading 939 of 1391 images...
downloading 940 of 1391 images...
downloading 94

downloading 1149 of 1391 images...
downloading 1150 of 1391 images...
downloading 1151 of 1391 images...
downloading 1152 of 1391 images...
downloading 1153 of 1391 images...
downloading 1154 of 1391 images...
downloading 1155 of 1391 images...
downloading 1156 of 1391 images...
downloading 1157 of 1391 images...
downloading 1158 of 1391 images...
downloading 1159 of 1391 images...
downloading 1160 of 1391 images...
downloading 1161 of 1391 images...
downloading 1162 of 1391 images...
downloading 1163 of 1391 images...
downloading 1164 of 1391 images...
downloading 1165 of 1391 images...
downloading 1166 of 1391 images...
downloading 1167 of 1391 images...
downloading 1168 of 1391 images...
downloading 1169 of 1391 images...
downloading 1170 of 1391 images...
downloading 1171 of 1391 images...
downloading 1172 of 1391 images...
downloading 1173 of 1391 images...
downloading 1174 of 1391 images...
downloading 1175 of 1391 images...
downloading 1176 of 1391 images...
downloading 1177 of 

downloading 1384 of 1391 images...
downloading 1385 of 1391 images...
downloading 1386 of 1391 images...
downloading 1387 of 1391 images...
downloading 1388 of 1391 images...
downloading 1389 of 1391 images...
downloading 1390 of 1391 images...


## Save to CSV

In [None]:
#break to prevent run all
break

In [99]:
df.to_csv('./datasets/url.csv',index=False)