# Final project IronHack

For the final project I decided to create a book recommendation system, you input a book's title and it returns another book that the system thinks you would like to read.

<br>
dataset used: https://www.kaggle.com/jealousleopard/goodreadsbooks


## imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from bs4 import BeautifulSoup
import re
from jupyterthemes import jtplot
jtplot.style(figsize=(18,11))

In [2]:
data = pd.read_csv(r"books.csv")
data.head()


Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,# num_pages,ratings_count,text_reviews_count;;;
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling-Mary GrandPré,4.56,0439785960,9780440000000.0,eng,652.0,1944099.0,26249;;;
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling-Mary GrandPré,4.49,0439358078,9780439000000.0,eng,870.0,1996446.0,27613;;;
2,3,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling-Mary GrandPré,4.47,0439554934,9780440000000.0,eng,320.0,5629932.0,70390;;;
3,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.41,0439554896,9780440000000.0,eng,352.0,6267.0,272;;;
4,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling-Mary GrandPré,4.55,043965548X,9780440000000.0,eng,435.0,2149872.0,33964;;;


## Cleaning dataset
starting with types and looking for Nan

In [3]:
data.dtypes

bookID                    object
title                     object
authors                   object
average_rating           float64
isbn                      object
isbn13                   float64
language_code             object
# num_pages              float64
ratings_count            float64
text_reviews_count;;;     object
dtype: object

In [4]:
data.isna().sum()

bookID                    0
title                    28
authors                  28
average_rating           28
isbn                     28
isbn13                   28
language_code            28
# num_pages              28
ratings_count            28
text_reviews_count;;;    28
dtype: int64

In [5]:
data.shape

(13719, 10)

In [6]:
data = data.dropna(axis=0, how="any")

In [7]:
data.isna().sum()

bookID                   0
title                    0
authors                  0
average_rating           0
isbn                     0
isbn13                   0
language_code            0
# num_pages              0
ratings_count            0
text_reviews_count;;;    0
dtype: int64

In [8]:
data.shape

(13691, 10)

In [9]:
data = data.rename(columns={"text_reviews_count;;;":"text_reviews_count", "# num_pages":"num_pages"})

In [10]:
data["text_reviews_count"] = data.text_reviews_count.str.replace(";;;","")

In [11]:
data["text_reviews_count"] = data["text_reviews_count"].str.split(";").str[0].astype(int)
data["num_pages"] = data["num_pages"].astype(int)
data["ratings_count"] = data["ratings_count"].astype(int)
data["isbn13"] = data["isbn13"].astype(int)

In [12]:
data.dtypes

bookID                 object
title                  object
authors                object
average_rating        float64
isbn                   object
isbn13                  int64
language_code          object
num_pages               int64
ratings_count           int64
text_reviews_count      int64
dtype: object

## Web Scraping
the web scrap took 5-10 hours to run, here is only the results( datasets "a" and "b") and I am going to merge these with the original dataset, "data".

<br>
ps: the code for the web scraping can be found at the bottom of this notebook.

"a" is the first spyder, it could not found all values so it returned a few 0's.
<br>
"b" is what is missing from the first spyder, scrapped from another site.

<br>
importing, transposing and renaming the datasets.

In [13]:
a = pd.read_csv(r"dfWebMenor.csv").transpose().reset_index()
b = pd.read_csv(r"dfWeb2Menor.csv").transpose().reset_index()

In [14]:
a.columns = ["isbn13", "year"]
b.columns = ["isbn13", "year"]

In [15]:
a

Unnamed: 0,isbn13,year
0,9780439785969,2005
1,9780439358071,2004
2,9780439554930,0
3,9780439554893,2003
4,9780439655484,2004
...,...,...
13686,9780061186424,2007
13687,9780930289553,1990
13688,9780061238963,2007
13689,9780743201117,2000


In [16]:
b

Unnamed: 0,isbn13,year
0,9780439554930,2003
1,9780767915069,2002
2,9780618346240,2003
3,9780618510825,2004
4,9780670059676,2006
...,...,...
1463,9788401335839,2006
1464,9782842281540,2002
1465,9780439856263,2006
1466,9780007137336,2004


checking the values

In [17]:
a.describe()

Unnamed: 0,year
count,13691.0
mean,1980.913228
std,3946.474944
min,0.0
25%,1990.0
50%,2001.0
75%,2005.0
max,301748.0


In [18]:
b.describe()

Unnamed: 0,year
count,1468.0
mean,1917.635559
std,396.12728
min,0.0
25%,1996.0
50%,2002.0
75%,2005.0
max,2016.0


The regex for the first spyder, "a", had the wrong limit to return, 7 instead of 4 (that is why `max`= 301748). I will filter the wrong values and scrap again.

In [19]:
mask = a["year"] > 2020
scrapAgain = a[mask]["isbn13"]

year3 is commented to prevent missclick's.

In [20]:
# year3 = dict()

In [21]:
for x in scrapAgain:
    if x not in year3:
        url = "https://isbndb.com/book/"
        response = requests.get(url+str(x))
        print(response)
        # confirmar resposta
        soup = BeautifulSoup(response.content)
        web = [x.text for x in soup.find_all("table", {"class":"table table-hover table-responsive" })]

        year_pub = re.findall(r"(?<!\d)\d{4}(?!\d)", str(web))
        try:
            year_pub = int(year_pub[0])
        except:
            year_pub = 0
            print("no year")
        year3[x] = year_pub

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<

new dataset to add to the original one.

In [580]:
c = pd.DataFrame(year3, index=range(len(year3)))
c = c.head(1)
c = c.transpose().reset_index()
c.columns = ["isbn13", "year"]
c

Unnamed: 0,isbn13,year
0,9780679431329,1994
1,9780814326114,2099
2,9780824512590,2002
3,9780440229353,0
4,9780486404882,1999
...,...,...
93,9780767922081,2006
94,9781879960749,3551
95,9780375760846,2006
96,9780151007158,6003


In [23]:
c.describe()

Unnamed: 0,year
count,98.0
mean,2738.673469
std,1915.737363
min,0.0
25%,1999.0
50%,2006.0
75%,3555.25
max,9901.0


Some values are still wrong, I am going to 0 them and drop later.

In [24]:
mask = c["year"] < 2020
c = c[mask]

In [25]:
c.describe()

Unnamed: 0,year
count,56.0
mean,1606.660714
std,801.588815
min,0.0
25%,1985.5
50%,2000.0
75%,2005.0
max,2010.0


In [27]:
data.dtypes

bookID                 object
title                  object
authors                object
average_rating        float64
isbn                   object
isbn13                  int64
language_code          object
num_pages               int64
ratings_count           int64
text_reviews_count      int64
dtype: object

In [28]:
a.dtypes

isbn13    object
year       int64
dtype: object

making `isbn13` the same types for all datasets and reseting index.

In [29]:
data["isbn13"] = data["isbn13"].astype(str)
data = data.reset_index()

Now merge all datasets together.

In [30]:
data = data.merge(a, on="isbn13", how="left")
data = data.merge(b, on="isbn13", how="left")
data = data.merge(c, on="isbn13", how="left")
data

Unnamed: 0,index,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,year_x,year_y,year
0,0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling-Mary GrandPré,4.56,0439785960,9780439785969,eng,652,1944099,26249,2005,,
1,1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling-Mary GrandPré,4.49,0439358078,9780439358071,eng,870,1996446,27613,2004,,
2,2,3,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling-Mary GrandPré,4.47,0439554934,9780439554930,eng,320,5629932,70390,0,2003.0,
3,3,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.41,0439554896,9780439554893,eng,352,6267,272,2003,,
4,4,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling-Mary GrandPré,4.55,043965548X,9780439655484,eng,435,2149872,33964,2004,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13686,13714,47699,M Is for Magic,Neil Gaiman-Teddy Kristiansen,3.82,0061186422,9780061186424,eng,260,11317,1060,2007,,
13687,13715,47700,Black Orchid,Neil Gaiman-Dave McKean,3.72,0930289552,9780930289553,eng,160,8710,361,1990,,
13688,13716,47701,InterWorld (InterWorld #1),Neil Gaiman-Michael Reaves,3.53,0061238961,9780061238963,en-US,239,14334,1485,2007,,
13689,13717,47708,The Faeries' Oracle,Brian Froud-Jessica Macbeth,4.43,0743201116,9780743201117,eng,224,1550,38,2000,,


Making the `year` collumn just one.

In [32]:
for i in range(len(data)):
    if data["year_x"][i] == 0:
        try:
            data["year_x"][i] = data["year_y"][i]
        except:
            data["year_x"][i] = data["year"][i]
        finally:
            pass


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [33]:
data

Unnamed: 0,index,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,year_x,year_y,year
0,0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling-Mary GrandPré,4.56,0439785960,9780439785969,eng,652,1944099,26249,2005,,
1,1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling-Mary GrandPré,4.49,0439358078,9780439358071,eng,870,1996446,27613,2004,,
2,2,3,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling-Mary GrandPré,4.47,0439554934,9780439554930,eng,320,5629932,70390,2003,2003.0,
3,3,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.41,0439554896,9780439554893,eng,352,6267,272,2003,,
4,4,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling-Mary GrandPré,4.55,043965548X,9780439655484,eng,435,2149872,33964,2004,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13686,13714,47699,M Is for Magic,Neil Gaiman-Teddy Kristiansen,3.82,0061186422,9780061186424,eng,260,11317,1060,2007,,
13687,13715,47700,Black Orchid,Neil Gaiman-Dave McKean,3.72,0930289552,9780930289553,eng,160,8710,361,1990,,
13688,13716,47701,InterWorld (InterWorld #1),Neil Gaiman-Michael Reaves,3.53,0061238961,9780061238963,en-US,239,14334,1485,2007,,
13689,13717,47708,The Faeries' Oracle,Brian Froud-Jessica Macbeth,4.43,0743201116,9780743201117,eng,224,1550,38,2000,,


The `year_x` that equal to 0 are going to be dropped as i said earlier. Also droping the collumns `year_y`, `year`, `index` and renaming `year_x`.

In [35]:
sum(data["year_x"] == 0)

60

In [96]:
try:
    data = data.drop(columns=["year_y","year", "index"])
    data = data.rename(columns={"year_x":"year"})
    data = data.drop(data[data["year"] == 0].index)
    data = data.reset_index().drop(columns="index")
except:
    pass

In [97]:
data

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,year
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling-Mary GrandPré,4.56,0439785960,9780439785969,eng,652,1944099,26249,2005
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling-Mary GrandPré,4.49,0439358078,9780439358071,eng,870,1996446,27613,2004
2,3,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling-Mary GrandPré,4.47,0439554934,9780439554930,eng,320,5629932,70390,2003
3,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.41,0439554896,9780439554893,eng,352,6267,272,2003
4,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling-Mary GrandPré,4.55,043965548X,9780439655484,eng,435,2149872,33964,2004
...,...,...,...,...,...,...,...,...,...,...,...
13626,47699,M Is for Magic,Neil Gaiman-Teddy Kristiansen,3.82,0061186422,9780061186424,eng,260,11317,1060,2007
13627,47700,Black Orchid,Neil Gaiman-Dave McKean,3.72,0930289552,9780930289553,eng,160,8710,361,1990
13628,47701,InterWorld (InterWorld #1),Neil Gaiman-Michael Reaves,3.53,0061238961,9780061238963,en-US,239,14334,1485,2007
13629,47708,The Faeries' Oracle,Brian Froud-Jessica Macbeth,4.43,0743201116,9780743201117,eng,224,1550,38,2000


In [43]:
data.dtypes

bookID                 object
title                  object
authors                object
average_rating        float64
isbn                   object
isbn13                 object
language_code          object
num_pages               int64
ratings_count           int64
text_reviews_count      int64
year                    int64
dtype: object

##  Model

I am going to use KDTree to calculate the distance beetween the books and recommend the 2 that are closer.

In [575]:
from sklearn.neighbors import KDTree
from sklearn.preprocessing import StandardScaler
X = data[["average_rating", "num_pages", "ratings_count"]] 
scaler = StandardScaler()
X = scaler.fit_transform(X)
tree = KDTree(X)             
dist, ind = tree.query(X, k=3)

ind = pd.DataFrame(ind)
dist = pd.DataFrame(dist)

Organizing the 2 new dataset, "ind" with the indexes to check the recommendation and "dist" with the distance beetwenn the books.

In [576]:
dist = dist.drop(columns=0)
dist = dist.rename(columns={1:"recomend1_Dist", 2:"recomend2_Dist"})
dist["title"] = data["title"]
dist

Unnamed: 0,recomend1_Dist,recomend2_Dist,title
0,0.999166,1.302397,Harry Potter and the Half-Blood Prince (Harry ...
1,0.999166,1.914767,Harry Potter and the Order of the Phoenix (Har...
2,11.437260,28.826986,Harry Potter and the Sorcerer's Stone (Harry P...
3,0.032068,0.037348,Harry Potter and the Chamber of Secrets (Harry...
4,0.620371,1.366191,Harry Potter and the Prisoner of Azkaban (Harr...
...,...,...,...
13626,0.044012,0.059623,M Is for Magic
13627,0.030229,0.046654,Black Orchid
13628,0.049635,0.058639,InterWorld (InterWorld #1)
13629,0.030062,0.031079,The Faeries' Oracle


In [None]:
def recommendation(string):
    """
    This function ask for an input(string), the book's title, and returns 2 recommendations. 
    """
    # uppercase to increase the match, the input and the list with the titles.
    string = string.upper()
    
    # sorting to later take only the top 5 matches.
    all_books_names = list(data.sort_values("ratings_count", ascending=False)["title"].astype(str).str.upper())
    
    # since I sort the list above, I need to create another list to have acess to the indexes.
    all_books_in = list(data["title"].astype(str).str.upper())
    listIndex = []
    i = 0
    
    for name in all_books_names:
        if string in name:
            if i < 5:
                i += 1
                listIndex.append(all_books_in.index(name))
                print(name,all_books_in.index(name))   
    
    if i == 0: # no matches
        print('\n "'+string+'"', "IS NOT ON OUR DATASET. PLEASE TRY ANOTHER BOOK.")
        recommendation(input("TYPE A BOOK YOU LIKE: \n"))
        
    elif i == 1: # only 1 match so it returns the recommendation
        print("\n\nOUR RECOMMENDATIONS ARE:")
        print("FIRST -", data["title"][ind["recomend1"][listIndex[0]]])
        print("SECOND -", data["title"][ind["recomend2"][listIndex[0]]])
        
    else: # more that 1 match, it needs another interation with the user
        number = int(input("\nCHOOSE THE BOOK BY TYPING THE NUMBER THAT APPEAR ON THE RIGHT SIDE. "))
        print("\n\nOUR RECOMMENDATIONS ARE:")
        print("FIRST -", data["title"][ind["recomend1"][number]])
        print("SECOND -", data["title"][ind["recomend2"][number]])

In [599]:
recommendation(input("type a book:  "))

type a book:  lalala bababab

 "LALALA BABABAB" IS NOT ON OUR DATASET. PLEASE TRY ANOTHER BOOK.
TYPE A BOOK YOU LIKE: 
harry potter
HARRY POTTER AND THE SORCERER'S STONE (HARRY POTTER  #1) 2
HARRY POTTER AND THE PRISONER OF AZKABAN (HARRY POTTER  #3) 4
HARRY POTTER AND THE CHAMBER OF SECRETS (HARRY POTTER  #2) 3
HARRY POTTER AND THE ORDER OF THE PHOENIX (HARRY POTTER  #5) 1
HARRY POTTER AND THE HALF-BLOOD PRINCE (HARRY POTTER  #6) 0

CHOOSE THE BOOK BY TYPING THE NUMBER THAT APPEAR ON THE RIGHT SIDE. 2


OUR RECOMMENDATION ARE:
FIRST - Twilight (Twilight  #1)
SECOND - The Hobbit or There and Back Again


In [600]:
recommendation(input("type a book:  "))

type a book:  death note
DEATH NOTE  VOL. 1: BOREDOM (DEATH NOTE  #1) 4497
DEATH NOTE  VOL. 2: CONFLUENCE (DEATH NOTE  #2) 4501
DEATH NOTE  VOL. 3: HARD RUN (DEATH NOTE  #3) 4500
DEATH NOTE  VOL. 4: LOVE (DEATH NOTE  #4) 4498
DEATH NOTE  VOL. 5: WHITEOUT (DEATH NOTE  #5) 4499

CHOOSE THE BOOK BY TYPING THE NUMBER THAT APPEAR ON THE RIGHT SIDE. 4501


OUR RECOMMENDATION ARE:
FIRST - Death Note  Vol. 3: Hard Run (Death Note  #3)
SECOND - Bleach  Volume 15


In [601]:
recommendation(input("type a book:  "))

type a book:  The World of The Dark Crystal
THE WORLD OF THE DARK CRYSTAL 13630


OUR RECOMMENDATION ARE:
FIRST - Happy Times in Noisy Village
SECOND - The Children of Noisy Village


In [577]:
ind = ind.drop(columns=0)
ind = ind.rename(columns={1:"recomend1", 2:"recomend2"})
ind["title"] = data["title"]
ind

Unnamed: 0,recomend1,recomend2,title
0,1,24,Harry Potter and the Half-Blood Prince (Harry ...
1,0,24,Harry Potter and the Order of the Phoenix (Har...
2,12167,1981,Harry Potter and the Sorcerer's Stone (Harry P...
3,3682,3690,Harry Potter and the Chamber of Secrets (Harry...
4,5264,24,Harry Potter and the Prisoner of Azkaban (Harr...
...,...,...,...
13626,1257,13246,M Is for Magic
13627,13033,8334,Black Orchid
13628,11137,7026,InterWorld (InterWorld #1)
13629,10777,7975,The Faeries' Oracle


## Web scrap code

In [14]:
data

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling-Mary GrandPré,4.56,0439785960,9780439785969,eng,652,1944099,26249
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling-Mary GrandPré,4.49,0439358078,9780439358071,eng,870,1996446,27613
2,3,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling-Mary GrandPré,4.47,0439554934,9780439554930,eng,320,5629932,70390
3,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.41,0439554896,9780439554893,eng,352,6267,272
4,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling-Mary GrandPré,4.55,043965548X,9780439655484,eng,435,2149872,33964
...,...,...,...,...,...,...,...,...,...,...
13714,47699,M Is for Magic,Neil Gaiman-Teddy Kristiansen,3.82,0061186422,9780061186424,eng,260,11317,1060
13715,47700,Black Orchid,Neil Gaiman-Dave McKean,3.72,0930289552,9780930289553,eng,160,8710,361
13716,47701,InterWorld (InterWorld #1),Neil Gaiman-Michael Reaves,3.53,0061238961,9780061238963,en-US,239,14334,1485
13717,47708,The Faeries' Oracle,Brian Froud-Jessica Macbeth,4.43,0743201116,9780743201117,eng,224,1550,38


Creating a dictionary with `isbn13` and `year` so we can merge them later.

Again, it took 5-10 hours for this spyder to run, so I will comment to initial var.

In [None]:
# year = dict()

In [70]:
for x in data["isbn13"]:
    if x not in year:
        url = "https://isbndb.com/book/"
        response = requests.get(url+str(x))
        print(response)
        # confirmar resposta
        soup = BeautifulSoup(response.content)
        web = [x.text for x in soup.find_all("table", {"class":"table table-hover table-responsive" })]

        year_pub = re.findall(r"(?<!\d)\d{4}(?!\d)", str(web))
        try:
            year_pub = int(year_pub[0])
        except:
            year_pub = 0
            print("no year")
        year[x] = year_pub

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<

<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>


<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Respons

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Respons

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>


<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
no year
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
no year
<Response [200]>
<Response 

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Respons

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
no year
<Response [200]>
<Response [200]>
<Respons

In [127]:
len(year)


13691

Creating a list to run the second spyder, remember it returned 0 if it couldn`t find the year.

In [120]:
index = []
for x in year:
    if year[x] == 0:
        index.append(x)  

In [None]:
# year2 = dict()

In [153]:
for x in index:
    if x not in year2:
        url = "https://www.justbooks.co.uk/search/?isbn="
        url2 = "&mode=isbn&st=sr&ac=qr"
        response = requests.get(url+str(x)+url2)
        print(response)

        soup = BeautifulSoup(response.content)
        web = [x.text for x in soup.find_all("span", {"class":"describe-isbn" })]

        year_pub = re.findall(r"(?<!\d)\d{4}(?!\d)", str(web))
        try:
            year_pub = int(year_pub[0])
        except:
            year_pub = 0
            print("error")
        year2[x] = year_pub

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
error
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>


In [154]:
len(year2)

1468

In [152]:
len(index)

1468

index2 is what is missing from both spyder, since its only 60 I dropped.

In [155]:
index2 = []
for x in year2:
    if year2[x] == 0:
        index2.append(x)

Like I said i exported and then imported the datasets so it could be saved.

In [204]:
dfWeb2 = pd.DataFrame(year2, index=range(len(year2)))

In [170]:
dfWeb2.head(1).to_csv(r"dfWeb2MenorRow.csv", index=False)

In [166]:
dfWeb = pd.DataFrame(year, index=range(len(year)))

In [168]:
dfWeb.head(1).to_csv(r"dfWebMenor.csv", index=False)

#### For later



Use theses site to get "About the book".


https://www.bookfinder.com/search/?author=&title=&lang=en&new_used=*&destination=br&currency=BRL&binding=*&isbn=9780439358071&keywords=&minprice=&maxprice=&publisher=&min_year=&max_year=&mode=advanced&st=sr&ac=qr
    
<br>
    
https://www.justbooks.co.uk/search/?author=&title=&lang=en&isbn=9780439358071&new_used=*&destination=br&currency=BRL&mode=basic&st=sr&ac=qr

Use this site to indicate places to buy the recommended book.

http://www.bookfinder4u.com/IsbnSearch.aspx?isbn=9780439554930&mode=direct