Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper only collects 118 byte files #49

Open
Pr0j3ct opened this issue Jul 22, 2024 · 16 comments
Open

Scraper only collects 118 byte files #49

Pr0j3ct opened this issue Jul 22, 2024 · 16 comments

Comments

@Pr0j3ct
Copy link

Pr0j3ct commented Jul 22, 2024

Approx 2 weeks ago the scraper only started collecting 118 byte files.

Does not appear to be IP address related. Has the VSCO API changed?

@sideloading
Copy link

Same issue here #48. I'm using https://github.com/mikf/gallery-dl which is working fine

@Pr0j3ct
Copy link
Author

Pr0j3ct commented Jul 25, 2024

One thing I noticed was that the sub-domain returns 403:
i.vsco.co

but using url like so:
vsco.co/i

returns the image without problem.

I'm no programmer but when I have some free time I may try and refactor atleast one of the modules to support that change and see what happens.

@intothevoid33
Copy link

@Pr0j3ct what do you mean?

I put a print statement into the script to see what it was trying to download. What printed out matched what I got when manually going to the gallery page, selecting and image and then inspecting it.

@parkerr82
Copy link

parkerr82 commented Jul 26, 2024 via email

@timbo0o1
Copy link

timbo0o1 commented Aug 1, 2024

Edit: Seems like they block the default request header which is used by the script.

You could simply set a custom header to your requests to get the images.

  1. create a new entry in constants.py
images = {
    'User-Agent': random.choice(user_agents),
    'Accept': 'image/avif,image/webp,image/png,image/svg+xml,image/*;q=0.8,*/*;q=0.5',
    'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
    'Connection': 'keep-alive',
    'Referer': 'https://vsco.co/',
    'Sec-Fetch-Dest': 'image',
    'Sec-Fetch-Mode': 'no-cors',
    'Sec-Fetch-Site': 'same-site',
    'Priority': 'u=4, i',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}
  1. use them in vscoscrape.py
def download_img_normal(self, lists):
        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(requests.get(lists[0], headers=constants.images, stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in requests.get(lists[0],headers=constants.images, stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True

Alternatively you could use cloudscraper instead of the python requests.

pip install cloudscraper

import cloudscraper
class Scraper(object):
    def __init__(self, cache, latestCache):
        self.cache = cache
        self.latestCache = latestCache
        self.scraper = cloudscraper.create_scraper()
def download_img_journal(self, lists):
        """
        Downloads the journal media in specified ways depending on the type of media

        Since Journal items can be text files, images, or videos, I had to make 3
        different ways of downloading

        :params: lists - No idea why I named it this, but it's a media item
        :return: a boolean on whether the journal media was able to be downloaded
        """
        if lists[1] == "txt":
            with open(f"{str(lists[0])}.txt", "w") as file:
                file.write(lists[0])
        if lists[2] == "img":
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(self.scraper.get(lists[0], stream=True).content)

        elif lists[2] == "vid":
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in self.scraper.get(lists[0], stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        self.progbarj.update()
        return True
def download_img_normal(self, lists):
        """
        This function makes sense at least

        The if '%s.whatever' sections are to skip downloading the file again if it's already been downloaded

        At the time I wrote this, I only remember seeing that images and videos were the only things allowed

        So I didn't write an if statement checking for text files, so this would just skip it I believe if it ever came up
        and return True

        :params: lists - My naming sense was beat. lists is just a media item.
        :return: a boolean on whether the media item was downloaded successfully
        """
        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(self.scraper.get(lists[0], stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in self.scraper.get(lists[0], stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True

@intothevoid33
Copy link

Edit: Seems like they block the default request header which is used by the script.

You could simply set a custom header to your requests to get the images.

  1. create a new entry in constants.py
images = {
    'User-Agent': random.choice(user_agents),
    'Accept': 'image/avif,image/webp,image/png,image/svg+xml,image/*;q=0.8,*/*;q=0.5',
    'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
    'Connection': 'keep-alive',
    'Referer': 'https://vsco.co/',
    'Sec-Fetch-Dest': 'image',
    'Sec-Fetch-Mode': 'no-cors',
    'Sec-Fetch-Site': 'same-site',
    'Priority': 'u=4, i',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}
  1. use them in vscoscrape.py
def download_img_normal(self, lists):
        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(requests.get(lists[0], headers=constants.images, stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in requests.get(lists[0],headers=constants.images, stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True

That works perfectly, thank you!

@spilla7
Copy link

spilla7 commented Aug 20, 2024

Co

Edit: Seems like they block the default request header which is used by the script.

Could someone please explain how to do this? Would like to get this working again. I've tried gallery-dl but prefer vscoscraper.

@timbo0o1
Copy link

Co

Edit: Seems like they block the default request header which is used by the script.

Could someone please explain how to do this? Would like to get this working again. I've tried gallery-dl but prefer vscoscraper.

I´ve already explained how to do this.
Where exactly do you need help?

@spilla7
Copy link

spilla7 commented Aug 24, 2024

Co
I´ve already explained how to do this. Where exactly do you need help?

I can see where to replace the txt in the constants.py file. But I'm not sure where to add the txt to the vscoscrpae.py file.

I've tried adding at the end but i get an error message when I run the script

Cheers

@AxelConceicao
Copy link

Co
I´ve already explained how to do this. Where exactly do you need help?

I can see where to replace the txt in the constants.py file. But I'm not sure where to add the txt to the vscoscrpae.py file.

I've tried adding at the end but i get an error message when I run the script

Cheers

nothing to replace in constants.py, just add images dict
and add headers=constants.images like he did in download_img_normal func

@billyklubb
Copy link

billyklubb commented Aug 28, 2024

Edit: Seems like they block the default request header which is used by the script.

You could simply set a custom header to your requests to get the images.

1. create a new entry in constants.py
images = {
    'User-Agent': random.choice(user_agents),
    'Accept': 'image/avif,image/webp,image/png,image/svg+xml,image/*;q=0.8,*/*;q=0.5',
    'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
    'Connection': 'keep-alive',
    'Referer': 'https://vsco.co/',
    'Sec-Fetch-Dest': 'image',
    'Sec-Fetch-Mode': 'no-cors',
    'Sec-Fetch-Site': 'same-site',
    'Priority': 'u=4, i',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}
2. use them in vscoscrape.py
def download_img_normal(self, lists):
        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(requests.get(lists[0], headers=constants.images, stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in requests.get(lists[0],headers=constants.images, stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True

Alternatively you could use cloudscraper instead of the python requests.

pip install cloudscraper

import cloudscraper
class Scraper(object):
    def __init__(self, cache, latestCache):
        self.cache = cache
        self.latestCache = latestCache
        self.scraper = cloudscraper.create_scraper()
def download_img_journal(self, lists):
        """
        Downloads the journal media in specified ways depending on the type of media

        Since Journal items can be text files, images, or videos, I had to make 3
        different ways of downloading

        :params: lists - No idea why I named it this, but it's a media item
        :return: a boolean on whether the journal media was able to be downloaded
        """
        if lists[1] == "txt":
            with open(f"{str(lists[0])}.txt", "w") as file:
                file.write(lists[0])
        if lists[2] == "img":
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(self.scraper.get(lists[0], stream=True).content)

        elif lists[2] == "vid":
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in self.scraper.get(lists[0], stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        self.progbarj.update()
        return True
def download_img_normal(self, lists):
        """
        This function makes sense at least

        The if '%s.whatever' sections are to skip downloading the file again if it's already been downloaded

        At the time I wrote this, I only remember seeing that images and videos were the only things allowed

        So I didn't write an if statement checking for text files, so this would just skip it I believe if it ever came up
        and return True

        :params: lists - My naming sense was beat. lists is just a media item.
        :return: a boolean on whether the media item was downloaded successfully
        """
        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(self.scraper.get(lists[0], stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in self.scraper.get(lists[0], stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True

Hey, so I am not a programmer in the least, the first two files you are referring to constants.py and vscoscrape.py, where are those located? and where are those new entries supposed to be in the files you mention? Of course any help is sincerely appreciated!

Edit: so when I look through the git for vsco-scraper I see the two files you are talking about, I am not sure what I am supposed to do with those files. I installed vsco-scraper with pip, so in this case do I need to edit the source and perform a build/compile or something along those lines? Forgive me, I only know that the vsco-scraper is in the bin folder off of my linux profile, after that I have zero ideas on what to do... =(

@timbo0o1
Copy link

timbo0o1 commented Aug 28, 2024

Hey, so I am not a programmer in the least, the first two files you are referring to constants.py and vscoscrape.py, where are those located? and where are those new entries supposed to be in the files you mention? Of course any help is sincerely appreciated!

Edit: so when I look through the git for vsco-scraper I see the two files you are talking about, I am not sure what I am supposed to do with those files. I installed vsco-scraper with pip, so in this case do I need to edit the source and perform a build/compile or something along those lines? Forgive me, I only know that the vsco-scraper is in the bin folder off of my linux profile, after that I have zero ideas on what to do... =(

if you installed vscoscrape with pip the files are located in your python installation.
Edit: to locate a pip package you can use the command "pip show vsco-scraper"
for example C:\Python310\Lib\site-packages\vscoscrape
You find the files there. (constants.py / vscoscrape.py).

No need to build from source. Just use the pip package and do the following.
Now open constants.py with your text editor and at the end of the file you paste this:

images = {
    'User-Agent': random.choice(user_agents),
    'Accept': 'image/avif,image/webp,image/png,image/svg+xml,image/*;q=0.8,*/*;q=0.5',
    'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
    'Connection': 'keep-alive',
    'Referer': 'https://vsco.co/',
    'Sec-Fetch-Dest': 'image',
    'Sec-Fetch-Mode': 'no-cors',
    'Sec-Fetch-Site': 'same-site',
    'Priority': 'u=4, i',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

Now open vscoscrape.py and search for download_img_normal
From there you select the whole function (until "return true")
Then you copy my function and replace it:

def download_img_normal(self, lists):
       
        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(requests.get(lists[0], headers=constants.images, stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in requests.get(lists[0],headers=constants.images, stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True

@billyklubb
Copy link

billyklubb commented Aug 28, 2024

Hey, so I am not a programmer in the least, the first two files you are referring to constants.py and vscoscrape.py, where are those located? and where are those new entries supposed to be in the files you mention? Of course any help is sincerely appreciated!
Edit: so when I look through the git for vsco-scraper I see the two files you are talking about, I am not sure what I am supposed to do with those files. I installed vsco-scraper with pip, so in this case do I need to edit the source and perform a build/compile or something along those lines? Forgive me, I only know that the vsco-scraper is in the bin folder off of my linux profile, after that I have zero ideas on what to do... =(

if you installed vscoscrape with pip the files are located in your python installation. Edit: to locate a pip package you can use the command "pip show vsco-scraper" for example C:\Python310\Lib\site-packages\vscoscrape You find the files there. (constants.py / vscoscrape.py).

No need to build from source. Just use the pip package and do the following. Now open constants.py with your text editor and at the end of the file you paste this:

images = {
    'User-Agent': random.choice(user_agents),
    'Accept': 'image/avif,image/webp,image/png,image/svg+xml,image/*;q=0.8,*/*;q=0.5',
    'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
    'Connection': 'keep-alive',
    'Referer': 'https://vsco.co/',
    'Sec-Fetch-Dest': 'image',
    'Sec-Fetch-Mode': 'no-cors',
    'Sec-Fetch-Site': 'same-site',
    'Priority': 'u=4, i',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

Now open vscoscrape.py and search for download_img_normal From there you select the whole function (until "return true") Then you copy my function and replace it:

def download_img_normal(self, lists):
       
        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(requests.get(lists[0], headers=constants.images, stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in requests.get(lists[0],headers=constants.images, stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True

Thank you very much!! Those changes were easy enough, first attempt gave me an indentation error, I just needed to move the
"def download_img_normal(self, lists):" line over a tab space to line up with all the others and it ran without issue! I really appreciate your time! =)

Edit: I tested if for journals, it produces the 118k files, I tried to sort it out, the block for journals is very different...

Edit: I figured it out, I looked for the function for downloading journals, and added "headers=constants.images" to the jpg and mp4 lines and it worked like a charm!

I'm certainly not a python programmer now...lol but reading through your code, I see that constants.images must refer to the constants.py file and the .images must refer to the images entry that you had me add! Thanks for helping me see it! =)

@bebunw
Copy link

bebunw commented Oct 1, 2024

thanks vm @timbo0o1, i know there is gallery-dl but it doesnt keep the same original filename and for updating an old folder it was ass

@birizui
Copy link

birizui commented Oct 25, 2024

Hey, so I am not a programmer in the least, the first two files you are referring to constants.py and vscoscrape.py, where are those located? and where are those new entries supposed to be in the files you mention? Of course any help is sincerely appreciated!
Edit: so when I look through the git for vsco-scraper I see the two files you are talking about, I am not sure what I am supposed to do with those files. I installed vsco-scraper with pip, so in this case do I need to edit the source and perform a build/compile or something along those lines? Forgive me, I only know that the vsco-scraper is in the bin folder off of my linux profile, after that I have zero ideas on what to do... =(

if you installed vscoscrape with pip the files are located in your python installation. Edit: to locate a pip package you can use the command "pip show vsco-scraper" for example C:\Python310\Lib\site-packages\vscoscrape You find the files there. (constants.py / vscoscrape.py).

No need to build from source. Just use the pip package and do the following. Now open constants.py with your text editor and at the end of the file you paste this:

images = {
    'User-Agent': random.choice(user_agents),
    'Accept': 'image/avif,image/webp,image/png,image/svg+xml,image/*;q=0.8,*/*;q=0.5',
    'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
    'Connection': 'keep-alive',
    'Referer': 'https://vsco.co/',
    'Sec-Fetch-Dest': 'image',
    'Sec-Fetch-Mode': 'no-cors',
    'Sec-Fetch-Site': 'same-site',
    'Priority': 'u=4, i',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

Now open vscoscrape.py and search for download_img_normal From there you select the whole function (until "return true") Then you copy my function and replace it:

def download_img_normal(self, lists):
       
        if lists[2] is False:
            if f"{lists[1]}.jpg" in os.listdir():
                return True
            with open(f"{str(lists[1])}.jpg", "wb") as file:
                file.write(requests.get(lists[0], headers=constants.images, stream=True).content)
        else:
            if f"{lists[1]}.mp4" in os.listdir():
                return True
            with open(f"{str(lists[1])}.mp4", "wb") as file:
                for chunk in requests.get(lists[0],headers=constants.images, stream=True).iter_content(
                    chunk_size=1024
                ):
                    if chunk:
                        file.write(chunk)
        return True

hey, thanks for the previous help. unfortunately the script doesn't work again. i tried to run it, but it shows '... crashed' for every usernames in my txt file. please take a look... thank you

@timbo0o1
Copy link

hey, thanks for the previous help. unfortunately the script doesn't work again. i tried to run it, but it shows '... crashed' for every usernames in my txt file. please take a look... thank you

Maybe take a look in here #50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants